Applied Soft Computing 12 (2012) 238–254
Contents lists available at SciVerse ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
A genetic algorithm-based rule extraction system Bikash Kanti Sarkar a,∗ , Shib Sankar Sana b , Kripasindhu Chaudhuri c a b c
Department of Information Technology, B.I.T., Mesra, Ranchi 835 215, Jharkhand, India Department of Mathematics, Bhangar Mahavidyalaya(C.U.), Bhangar 743 502 24-Pgs(S), W.B., India Department of Mathematics, Jadavpur University, Kolkata 32, India
a r t i c l e
i n f o
Article history: Received 15 July 2010 Received in revised form 7 June 2011 Accepted 21 August 2011 Available online 3 September 2011 Keywords: Classification Accuracy C4.5 Genetic algorithm Hybrid system
a b s t r a c t Individual classifiers predict unknown objects. Although, these are usually domain specific, and lack the property of scaling up prediction while handling data sets with huge size and high-dimensionality or imbalance class distribution. This article introduces an accuracy-based learning system called DTGA (decision tree and genetic algorithm) that aims to improve prediction accuracy over any classification problem irrespective to domain, size, dimensionality and class distribution. More specifically, the proposed system consists of two rule inducing phases. In the first phase, a base classifier, C4.5 (a decision tree based rule inducer) is used to produce rules from training data set, whereas GA (genetic algorithm) in the next phase refines them with the aim to provide more accurate and high-performance rules for prediction. The system has been compared with competent non-GA based systems: neural network, Naïve Bayes, rule-based classifier using rough set theory and C4.5 (i.e., the base classifier of DTGA), on a number of benchmark datasets collected from UCI (University of California at Irvine) machine learning repository. Empirical results demonstrate that the proposed hybrid approach provides marked improvement in a number of cases. © 2011 Elsevier B.V. All rights reserved.
1. Introduction In the last decades, one can observe a growing research interest in the fields of machine learning and knowledge discovery [13]. Recently, the amount of data stored in databases continues to grow fast. Intuitively, this large amount of stored data contains valuable hidden knowledge, which could be used to improve the decisionmaking process of an organization. For instance, data about previous sales might contain interesting relationships between products and customers. The discovery of such relationships can be very useful to increase the sales of a company. However, the number of human data analysts grows at a much smaller rate than the amount of stored data. Thus, there is a clear need of (semi-)automatic methods for extracting knowledge from data, and many data mining algorithms have been developed in order to extract knowledge from large data bases [5]. In practice, one of the main tasks considered in knowledge discovery is supervised classification, where learning process is provided with a set of training examples of target concepts (i.e., classes). Specifically, each example is described by a finite number of non-target attributes and a target (class) attribute, and the goal of learning is to discover a rule or a function (in machine
∗ Corresponding author. E-mail addresses: bk
[email protected] (B.K. Sarkar), shib
[email protected] (S.S. Sana),
[email protected] (K. Chaudhuri). 1568-4946/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2011.08.049
learning, often called a hypothesis) that maps such descriptions into those classes [13,15]. Furthermore, an algorithm, which consists of knowledge representation (learned from some training set) and the strategy of its usage, forms a classifier, can be used to predict classes of new coming objects. A typical criterion used to evaluate classifier’s performance is classification accuracy, i.e., percentage of correctly classified examples. Several single classifiers have been proposed over the years for inducing various knowledge representations (a review is available, for example, in [9,11,13,20]. Although most of these are very effective to particular data sets, and they do not always lead to satisfactory classification accuracy in more complex and difficult cases. For instance, the theoretical and empirical comparative studies [4,14] have confirmed that there is no single best algorithm to be used for all datasets. Note that, one of the main reasons behind such obstacles is the nature of data set. For clarity, if a data set is linear but a non-linear learning algorithm is applied or the reverse, then performance accuracy usually suffers, i.e., performance mainly depends on the nature of the data set. Shortly speaking, different algorithms have different strengths and weaknesses—a fact which makes some of them more suitable for certain problems than others. So, to improve these obstacles, the traditional methods are recently combined among themselves or with GA [37] to tackle classification problems. Such systems are known under several names such as: multiple classifiers, ensembles, committees or classifier fusion [2,4,9,23]. In general, such a
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
system takes a set of same or different learning algorithms to apply them respectively on different or same data sets of the same problem in order to induce knowledge. Then, their predictions are combined to result an integrated system for making decision more accurately. A number of algorithms has been designed to induce knowledge based on this strategy. For a review, one may see [3,6–8,10,12,16,20–22]. However, in the recent years, there has been an increasing interest in involving the use of evolutionary algorithm to learn knowledge. A short review is presented here. GAs have been successfully implemented in learning tasks in different domains such as chemical process control [28], financial classification [29], manufacturing scheduling [31], robot control [30]. GP (Genetic Program: an extension of GA) [32] is used to evolve a population of Fuzzy rule sets. Ester Bernado et al. [27] have provided an accuracy-based GA approach UCS. Sarkar et al. [17,19] have studied accuracy based learning classification system, combining C4.5 [18] and GA. In order to forecast the future sales of a printed circuit board factory more accurately, Chang et al. [33] have proposed a hybrid model in which GA is utilized to optimize the Fuzzy Rule Base (FRB) adopted by the Self-Organization Map (SOM) neural network. The GA part of the hybrid model [34] has been employed to find out an optimal structuring element for classifying garment defect types. Faraoun and Boukelif [35] made an attempt to show the use of a new GP classification approach for performing network intrusion detection. Wong et al. [36] presented a decision support tool, combining an expert system and the Takigi-Sugeno Fuzzy Neural Network (TSFNN) for fashion coordination. They have also shown that the GA plays here an important role in reducing the coordination rules and the training time for TSFNN. Albert et al. [50] investigate the capabilities of evolutionary on-line rule-based systems, also called learning classifier systems (LCSs), for extracting knowledge from imbalanced data. Finally, Urbanowicz Ryan and Moore [51] have given a complete roadmap on LCS. Looking into the past researches on machine learning and referring the article [51], we may comment that most of the classification systems are designed for solving specific types of problems. Some are specialized to solve multiple class learning problems. A few are designed for class imbalanced problems. But learning system concentrating on various factors altogether is rarely designed. Again based on the analysis of the existing GA-based systems, implementation and computation costs are identified as their common issues. Besides, they usually suffer from improvement of performance over volumetric data because of increasing population-size. With these points in mind, a hybrid GA-based learning system (named as DTGA) is proposed in this study to improve prediction accuracy over any classification problem irrespective to domain, size, dimensionality and class distribution. The combining approach is very simple and straightforward to implement. Structurally, the system consists of three phases: (i) the first phase applies C4.5 rule induction algorithm to extract a base set of IF-THEN rules (R) from known labeled “training” data set (say Etrain ), whereas (ii) the Interface [26] is used in the next phase to tackle conveniently the interpretability problem of the rules for applying GA, and finally (iii) the suggested GA is employed on the R to filter out informative rules based on the same training set Etrain . More specifically, at GA-phase, the worst rules in R are expected to be replaced with the better new rules (in terms of higher accuracy and lower error rate) of classes identical to the respective worst rules. For instance, an existing worst rule of a particular class (say c1 ) is replaced with a better new rule of class c1 . This phenomenon especially concentrates to improve classification performance of the system over any data set (even on imbalance data set). In other words, the better new rules generated in GA phase are desired to be survived, removing the worst rules of the respective classes. It is, indeed, based on steady-state approach of GA. Clearly, the numbers of rules of respective classes present in the original rule set R for any classification problem remain unchanged
239
in its optimized version too; only the better rules are copied in place of the worst rules to improve the overall classification performance of the rule set. In addition, the proposed system has also the ability to improver performance over volumetric data because it begins by taking a smaller-size (i.e., population size is comparatively small) but informative rule set. Note that the past research states if the population size increases, then performance degrades since waiting time for effective crossover becomes too long and there is insufficient juxtaposition of building block prior to convergence. Further, a system performs poorly over large date when GA alone is used, since there is no variation in the population for crossover. However, GA in our system starts with a better population produced by C4.5. Now regarding C4.5, it is true that the rules learned by C4.5 are collectively visualized in form of decision tree, and their representation is quite precise and simple to interpret in comparison to the other learners. Moreover, such rules are comparatively convenient to apply GA. Also, constructing decision trees are computationally inexpensive, even when the training set size is very large. Although, it is difficult to get the best decision tree from information plus, just using such a heuristic strategy. In fact, constructing the best decision tree (in terms of number of nodes) is an NP-complete problem [49]. The defect is due to searching strategy. Besides, decision tree rule induction algorithms like C4.5 usually discover rules which are preferably better on training set rather than the test set due to the presence of deeper nodes in the tree. Actually, information gained by deeper nodes is nominal and reflects only on training set too much (known as overfitting). In addition, such an overfitted model certainly requires more storage space and longer time to execute. Of course, pruning strategies are commonly applied to stop the growth of tree (i.e., to correct overfitting). But pruning a decision tree certainly causes to misclassify some examples from training set. More importantly, the misclassification rate (i.e., error rate) increases especially on imbalance data set (i.e., data set with uneven class distribution). It may be noted here that decision tree based learners are normally biased towards majority classes [40,41]. In fact, the construction strategy of decision tree prioritizes on the number of examples with class values, and each time one best attribute is selected as a node of the ongoing tree based on probabilistic measure. Certainly, the final constructed decision tree tends to classify all the samples into major classes, and accordingly imbalance problem causes serious affect the performance of the decision tree. Interestingly, GA has the specific function to solve the problems suffered by C4.5. It can improve and increase the ability of classification algorithm. So, to overcome considerably the drawbacks of C4.5, we prefer here GA to refine the rule set (R) learned by C4.5 in the subsequent phase by keeping the number of rules unchanged. More clearly, the ultimate goal of this phase is to evolve maximally accurate (i.e., informative) rules and replace the worst ones of R with the better new ones to achieve high classification accuracy. After all, the contribution of the proposed sampling technique (as described in Section 2.1) also is not nominal here. Truly speaking, this sampling strategy acts to enrich the population (generated by base classifier) for the GA. To show the strength of the proposed learning system, experiments are performed on a number benchmark data sets [1]. Although, the attributes in the selected data sets contain continuous values but they are discretized by SPID4.7 discretizer [25]. Apart from the base learner C4.5 (Decision Tree), the performance of the present system is compared with three other competent learners (each belonging to a distinct family), namely, Neural Network [24] (Artificial Neural Network), Naïve Bayes [24] (Bayesian Network) and Rule-based classifier on rough set theory [44], over the selected data sets. Let us remind here that there exist varieties in Neural Network. The approach selected here is based on input, hidden and output layers. Input layer has n neurons – one for each input
240
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
parameter, whereas output layer has k neurons – one for each class. The hidden layer has (n + k)/2 neurons. Each neuron uses sigmoid function. Apparently, the idea of the constructed system is much similar to the methods [17,19] but the present investigation differs the other two in the following respects: (i) the number of selected data sets, the splitting technique, (ii) fitness function, encoding and decoding techniques adopted in the GA, and finally (iii) the number of classifiers for comparing performance and the evaluation strategies. The paper is organized as follows. Section 1 discusses the importance of machine learning algorithms, its drawbacks and the recent trend. Section 2 introduces Learning Classifier System (LCS) [38] and the proposed DTGA approach including the suggested sampling technique. Section 3 describes shortly each of the other learners used in this study, whereas experimental design and analysis of the results are discussed in Section 4. Finally, conclusion is summarized in Section 5. 2. DTGA as LCS J. Holland presented the first architecture of LCS in 1975. Since then, a number of investigations on its architecture has been performed to solve classification problem. Although, in 1976, it is split into two categories depending upon where the GA acts: (i) Pitsburgstyle LCS in which a population of separate rule sets is considered, and the GA recombines and reproduces the best of these rule sets, whereas; (ii) Michigan-style LCS consists of only a single population and the GA focuses on selecting the best rules within the rule set. UCS (an accuracy based classifier system) is an example of Michigan-style LCS, specifically designed for supervised environment. Basically, it keeps the principal features of XCS (a best action map based classifier system) [39], but changes the way in which accuracy is computed. In particular, the GA in XCS is run on the action sets preferably, whereas the GA in UCS is inherited from XCS and performs crossover and mutation, selecting two classifiers from correct set[C] with the probability proportional to the fitness. The resulting offspring are then inserted in the population, leaving the incorrect rules. Structurally, many of the design criteria are optimized in UCS. As per the discussion on LCS-style, we may comment that the present investigation DTGA is also based on Michigan-style LCS, and consists of three phases (as already pointed out above). Fig. 1 illustrates the conceptual model of the proposed learning system. Also, the procedures involving in the phases are discussed in this section. But before discussing them in details, we must first describe the proposed sampling strategy adopted in the current model. 2.1. Proposed data-splitting technique Data splitting technique is an important issue in machine learning. It is really a very difficult task to identify how many instances are sufficient to gain knowledge for making good decision. Further, a data set may be imbalance too, i.e., the majority class of the data set may dominate heavily the minority class. Although, this sort of data is usually found in medical domain. Experimentally, the imbalance nature of data set is often reported as an obstacle to the induction of good classifiers. Some common data-sampling techniques such as: random over-sampling, random under-sampling, synthetic minority over-sampling technique (SMOTE), etc. are usually followed to handle imbalance data set. A brief on each is given below. • Random over-sampling is a non-heuristic method that aims to balance class distribution through the random replication of minority class examples.
• Random under-sampling is also a non-heuristic method that aims to balance class distribution through random elimination of majority class examples. • Synthetic minority over-sampling technique (SMOTE) The SMOTE is an over-sampling method that creates new minority class instances by interpolating several minority class examples that lay nearby in the feature space. It is said that this method avoids overfitting by creating rather than replying instances of the minority class. Although, each of the above groups has some drawbacks such as: computational load increases due to over-sampling, whereas undersampling does not take into account all available training data which corresponds to loss of available information. In fact, there is no good solution for such a problem, and so this research problem is still open. However, keeping imbalance problem of data set as one point in mind and concentrating on the learning strategy of the base classifier C4.5 of DTGA, we suggest here a new data sampling technique. It is presented below. In this approach, approximately 30% examples from each data set (say E) are selected as training set labeled as Etrain , maintaining almost equal proportion in class distribution over the data. In other words, attention is paid here to include 30% examples of each classvalue into Etrain . For instance, if there are 3 class values (say c1 , c2 and c3 ) in a classification problem P with 150 examples in total, and number of examples of class-types: c1 , c2 and c3 are respectively 30, 45 and 75. Then, 9, 14 and 23 examples of class-types c1 , c2 and c3 , respectively are to be included into Etrain by random selection on E. More clearly, 3 examples (each of one distinct class value) are included into Etrain. in the first pass. However, as soon as one example is included into Etrain , it is immediately crossed out from E. Similarly, in the next pass, another 3 distinct examples are to be included into Etrain and then these are crossed out from E and so on, until Etrain consists of at least 30% of the total. This hold-and-out strategy is, in fact, known as sampling without replacement. Let us remind here that we apply ceil function to place 30% examples of each class into Etrain , and inclusion of examples of any class-label is stopped immediately when 30% examples of that class-type are already in Etrain . For instance, (30 × 45)/100 = 13.5/2 = 14 examples of class c2 (out of 45) are to be included here into Etrain. Finally, when Etrain is constructed, then the remaining of E (i.e., E − Etrain ) is treated as test set and labeled as Etest . Now, both the Etrain and Etest are considered as training set and test set respectively to learning algorithms. However, contents of Etrain and Etest vary at each iteration, while keeping their sizes unchanged. It is, indeed, a sampling technique based on natural distribution of classes. However, the approach has capability to control the biasness of majority classes in the training set, since less percentage of each class value is selected. Let us highlight the expected strengths of this sampling technique as follows: • First of all, such a sampling strategy provides a smaller size training set with almost uniform proportion in class distribution over the data (i.e., no class value is ignored). Now, as per the concept of information theory, the more uniform is the probability distribution, the greater is its information. In other words, if the training set avails examples of all classes in uniform proportion, then it should be much more informative than that of ignoring the rare classes. Again, entropy (i.e., degree of doubt) relates to information for a message with negative sign only, implies that amount of doubt (impurity) is reduced by gaining/conveying same amount of information. Hence, looking into the entropy function (see at Section 2.2), we expect that an overall balanced rule set (i.e., informative rules consisting of all classes) will be generated from the training set by the base classifier C4.5 of the proposed system at the very
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
241
Fig. 1. Discovering optimized rules by DTGA.
beginning of the investigation. Usually, C4.5 generates rules, each of which covers more training examples. However, if rules covering class values with small training examples are ignored, then classification accuracy significantly degrades. But our sampling strategy tries to stop it. Indirectly, it pays attention to overcome imbalance problem in the data set to some extent. More importantly, if training set covers examples more than 30% of entire data set by maintaining uniform proportion in class distribution, then probability of availability of minority-class examples in the training set decreases due to their rareness in the entire set. This causes again the biasness towards majority classes. That is why we have chosen here 30% instances in training set and the remaining 70% in test set. • More over, such a balanced rule set claims to consist of small number of rules as compared to rule set generated from a training set containing more than 30% examples of the entire data set. Accordingly, it makes convenient to apply GA especially on volumetric data set, since GA usually suffers from improvement of accuracy over large data set due large-size population. • After all, such a smaller-size training set reduces the biasness of any learning model towards the known data in order to make it reliable on unseen data, since larger training data increases the biasness towards the known data.
out going edge from an internal node represents one value (known as test value) of that attribute. On the other hand, other nodes are called leaves (known as terminal or decision nodes). Basically, each internal node in decision tree splits the instance space into two or more sub-spaces according to the certain decision function of the input attribute values, and each leaf is assigned to one class representing the most appropriate target value. Obviously, instances are classified by navigating them from the root (the most informative node) of the tree down to a leaf, according to the outcome of the tests along the path. Concisely, a decision tree is tree in which the nodes represent attributes of a data set. The attributes are chosen according to some criterion, such as information gain. The information gain indicates how informative an attribute is with respect to the classification task using its entropy. The higher the variability of the attribute values is the higher its information gain. The attributes with the higher information gain are chosen to create the next nodes. The leaves of the tree represent the classes. Hence, a top-down approach is adopted here. The C4.5 is a good example of decision tree based rule induction algorithm that uses entropy computation to determine the best relevant attribute at each node of the tree. In fact, entropy is used to measure how informative is a node, and it is defined as,
The strengths of the proposed sampling technique are shown in Section 4 and Appendix D.
Entropy (S) = −
2.2. C4.5: a decision tree based classifier
where S is the collection of learning examples and pi is the proportion of S belonging to class i among c (number of) classes. In C4.5 approach, three distinct steps can be identified: (i) constructing a Decision tree, (ii) pruning the tree and (iii) rule induction based on the pruned tree. Each step operates at different level. In the first step, the C4.5u (unpruned) algorithm constructs a decision tree. During this process, the conjuncts of the rule set are chosen one by one, considering the information gain (or gain ratio) of the possible attributes. This recursive partition method begins
A decision tree is a classifier expressed as a recursive partition of the instance space. It consists of nodes that form a rooted tree, meaning it is a directed tree with a node called a “root” that has no incoming edges. All the other nodes have exactly one incoming edge but may have zero or more outgoing edges. By definition, a node with outgoing edges is referred to as internal or test nodes. In particular, each internal node represents an attribute, and each
c
pi log2 pi ,
i=1
242
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
with bulk data because of the fact that the first branching is based on all learning materials. But it becomes more and more local, as each choice for branching the decision tree is dependent on a local comparison of the entropy of the remaining learning examples. In the second step, the C4.5p (pruned) algorithm prunes the tree generated by C4.5u. The resulting tree of C4.5u is often very complex and over-fits the data. So, the idea is to remove parts of the tree that do not contribute to classification accuracy on new material, through C4.5p. Further, error estimation for leaves and sub-trees are computed to remove or combine parts of the tree. Finally, the C4.5r (rule induction) algorithm accepts the decision tree obtained by C4.5p and each path from the top node to a leaf of the modified tree yields a potential decision rule which is expressed in simple IF–THEN form. In fact, Quinlan developed ID3 rule induction algorithm (based on decision tree) in 1986 and C4.5 induction s/w package in 1993. The C4.5 is derived from ID3 but it improves ID3, as it handles continuous as well as missing data, and follows pruning strategies to remove the parts of the tree that do not contribute to classification accuracy. 2.3. Interface In general, the decision rules (i.e., knowledge) induced from examples are represented as logical expressions of the form: IF (conditions) THEN (decision class), where conditions in each rule are conjunctions of elementary tests on values of attributes, and decision part indicates the assignment of an object (that satisfies the condition part) to a given decision class. Thus, each rule can be viewed as: antecedent → consequent, where antecedent part consists of conjuncts (i.e., conditions) and consequent is the decision (i.e., action). In the present approach, the C4.5 rule induction algorithm adopted in the first phase extracts rules from training examples. Each extracted rule is here close to ‘IF–THEN’ form, and it is shown through the example of golf-playing problem presented in Appendix B. Obviously, the rules of such ‘IF–THEN’ format creates interpretability problem to apply GA. So, to make them more convenient for applying GA, we prefer here such rules in tabular like form (as shown in the last part of Appendix B). In particular, the Interface performs the task of such representation of the rules, eliminating ‘IF–THEN’ part. The discrete values below the names of the attributes in such a tabular form specify their respective values. Actually, a ‘*’ symbol appearing in a rule simply says absence of condition of attribute corresponding to the position of ‘*’ (i.e., the attribute corresponding to ‘*’ has no importance in that rule). Although, ‘*’ value for an attribute (say Ai ) in a rule is positioned following its (i.e., Ai ’s) position in the data set. Again, all the nontarget attributes irrespective to their presence or absence in rule(s) are herein strictly considered in rules with a view to simpler implementation.
mutation or other operators. In practice, encoding of parents is done before crossover operation. There are several ways of encoding parents. For example, one can encode individuals directly as integer/real number or as array of integers/decimal numbers with each position, representing some particular aspect of solution. However, the solutions are normally encoded by equivalent binary representation. Among the possible genetic operators, crossover is an important one. It merges the genetic information of two existing individuals (parents), picked up by selection operator, and creates two new individuals (children) called as offspring. There are numerous ways to perform crossover operation but the simplest one is the singlepoint crossover where two chosen individuals are cut at a randomly selected point within the length of the parents. The tails (the parts after the cutting point) are swapped, leading to two new individuals. In general, if the representation of the individuals is binary (0, 1), then in Mutation operation, a zero (‘0’) is changed to one (‘1’) and vice versa. Although, a simple GA treats the Mutation only as secondary operator with restoration of the genetic material. However, it helps to find the global optimal solution of the problem, searching new areas. Further, the fitness function ensures the evolution towards optimization by calculating the fitness score for each individual in the population. In fact, this value evaluates the performance of each individual in the population. Concisely, the whole process of evolving from one population to next population is called a generation. The process continues until a predefined termination criterion (for instance, the achievement of a performance target or a certain number of fruitful generations) has been met. 2.4.1. Fitness function The fitness function is essentially the objective function for the problem. It provides a means of evaluating the search solutions, and also controls the selection process. However, it is well-accepted that the fitness function is the only problem dependent part of GA. For classification, we can consider factors such as prediction accuracy, error (i.e., misclassification) rate, imbalance class problem, etc. Keeping these points with mind, we have proposed here a new promising fitness formula (as given below). f (ri ) =
n n−m + n+m m+k
(1)
where ri represents the ith decision rule, ‘n’ is the number of training examples satisfying all the conditions in the antecedent (A) as well as the consequent (C) of the rule (ri ) too, i.e., correct classification; ‘m’, the number of training examples which satisfy all the conditions in the antecedent (A) part but not the consequent (C) of the rule (ri ), i.e., misclassification; and ‘k’, a predefined positive constant value. For instance, suppose n = 8, m = 2 for a rule r. Again, assume that k = 4. Therefore, the fitness value of rule ri will be as follows.
2.4. Proposed GA based sub-system
8−2 8 + = 1.99 8+2 2+4
This section begins with a short introduction on genetic algorithm, and then discusses in details about each genetic component proposed in this article. In fact, genetic algorithms are popular search algorithms based on the mechanics of natural selection and evolution, start with an initial population of solutions of a problem. The initial population is then manipulated using various genetic operators to produce new population with the aim to optimize the solution(s) of the problem. The frequently used genetic operators are selection, crossover, mutation, dropping condition, etc. Normally, two parents are chosen randomly from a population, and two new children are generated by applying subsequently crossover,
Ideally, this evaluation function reduces collision occurred among fitness scores of the rules, and ensures to survive the rule with higher classification accuracy but lower error-rate, i.e., it removes noisy rule. In fact, the first part of the function plays a great role to satisfy that task. A brief illustration on the presented function is given below to justify its goodness. Suppose that two rules: r1 and r2 are classifying correctly 16 and 20 examples respectively, whereas they are misclassifying say 1 and 5 examples. Clearly, n = 16, m = 1 for r1 , whereas n = 20, m = 5 for r2 . In such a case, it is very difficult to choose the better one between these two rules by the fitness function suggested in [19],
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
since both the rules return identical score as per that adopted evaluation function. However, the present fitness function given in Eq. (1) easily resolves such tie by resulting fitness scores 4.08 and 2.82, respectively for r1 and r2 . It can be pointed out here that occurring identical value by the first part of this function for two different rules is rare, unless the values of m and n are multiplied by same factor or the value of m is zero. If it happens so, then the second part of the function takes a vital role to break the tie. Note that a parameter k with positive constant is introduced here, as the case: m = 0 may occur. Thus, the present fitness function reduces the chance of occurring collision in fitness score, and highly claims the survival of appropriate rule on the basis of higher fitness score. In fact, the two parts altogether in the present evaluation function are satisfying such a role. In addition, it does not ignore to discover better rules of minority classes also due the presence of balanced rule set. 2.4.2. Encoding and decoding strategy In this phase, the discretized values in rules (solutions) are directly used. No other representations such as binary coding (and the reverse) are applied here. In fact, the main reason behind considering such a strategy is that the values of the attributes present in the conjuncts of the rules are significantly selected by C4.5, and we are not interested to loose these values. Only the better new individuals which are fit for test (unseen) data set, are expected to generate by recombining the conjuncts. Clearly, such a strategy is very simple and less time consuming, as individuals are computer program rather than bit strings. 2.4.3. Genetic operators 2.4.3.1. Selection and crossover. In our experiment, each rule of any classification problem is considered to be of uniform length, assuming unit length for each attribute. For example, if a classification problem P consists of r attributes (including the target as well as the non-target attributes), then length (L) of each rule of the same problem is considered as r (since, all the non-target attributes of the problem irrespective to their presence or absence in rule are herein strictly considered in rule for simpler implementation, and unit length is assigned to each attribute). In this study, the selection of parents for performing cross-over is done at random. Actually, rules in the rule set (i.e., population) are numbered as: 1, 2, . . ., n. Now, two parents say: p1 and p2 are picked randomly from the current rule set, and placed them into a mating pool. Next, in the purpose of cross-over operation, two distinct points (xi , i = 1, 2) are chosen randomly within 1 < xi < L. Finally, the heads and tails (the parts respectively before and after the cutting points) of the parents are swapped, leading to two new individuals: say O1 and O2 . 2.4.4. Proposed genetic-based algorithm The proposed algorithm begins with an overall balanced rule set R learned by the base classifier C4.5 applied on a set of training examples (Etrain ). The training set is, in fact, chosen by the strategy discussed in Section 2.1, and the same training set Etrain is used for our GA too. The GA proceeds by choosing chromosomes (rules) to serve as parents and possible replacement of those with new chromosomes based on fitness score. Obviously, the quality of each rule r1 ∈ R is measured by Eq. (1) (specified in Section 2.4.1). Also, in order to find the overall fitness score of the new rule set while replacing the worst one (s) of R with the generated children, the approach uses a temporary data structure RT (structurally similar to R). Let us remind here that, each time, all the rules from RT are deleted before copying the content of R into RT . Further, the algorithm adopts the coding and crossover techniques illustrated in Sections 2.4.2 and 2.4.3, respectively, and terminates after a specific number of generations. The goal of the algorithm is to replace the worst rule with
243
the better new rule with identical class of the worst one in order to improve prediction accuracy. Clearly, rules of minority classes will not be removed from the refined rule set. A brief sketch of the algorithm is outlined below. Assumptions: • Input examples are discreetized • Rules are of uniform length and, take the form like: [1 * 2 1 * 3 0] (Last value is the class attribute’s value ‘*’ is treated as don’t care value.) Variables: Max itr: maximum number of iterations (generations)//it is a predefined number no itr: number of iterations (generations) RT : rule set Input: R (rule set generated by C4.5), Etrain (training data set), Max itr //f(ri ) implies fitness score of rule ri computed on Etrain using Eq. (1) //F(R) implies overall fitness of rule set R computed on Etrain using Eq. (2) begin no itr ← 0 //initially zero is assigned to iterations (generations) RT ← NULL Step 1
Randomly select two parents: P1 and P2 from R, and place them into a mating pool Apply the suggested two-point crossover operation on the Step 2 selected two parents P1 and P2 to generate two new offspring say: O1 and O2 no itr ← no itr + 1 for each existing rule ri ∈ R do the followings: Step 3 /* This segment attempts to find sub/super rule of O1 in R */ Step 3.1 If the class values of both O1 and ri are same then begin Find the number (m) of pre-conditions matched between Step 3.1.1 O1 and ri . If (m =min(m1 , m2 )) then /* min(m1 , m2 ) returns the minimum between m1 and m2 , where m1 and m2 are respectively the number of exact precondition present in O1 and ri */ If (m1 < m2 ) then copy O1 in place of ri else discard O1 and go to step-5. end /* next part attempts to find conflict rule of O1 (if any) in R, and then not to include O1 .*/ else //i.e., class values are not same Step 3.2 If the antecedent part(i.e., LHS) of O1 is identical to ri , then Step 3.2.1 discard O1 and go to step-5. end for //This segment attempts to place O1 in place of some other distinct rule of same class in R, if possible Find the lowest fitness-valued rule rw from R, having class Step 4 same as of O1 . If f(O1 ) < f(rw ) then go to step-5 else compute F(RT ) on Etrain by copying the current content of R into RT and replacing rw with offspring O1 . If F(RT ) > F(R) then copy O1 in place of rw in R else ignore O1 . for each rule ri ∈ R do the followings: Step 5 /* This part attempts to find sub/super rule of O2 in R */ If the class values of both O2 and ri are same then Step 5.1 begin Find the number(m) of pre-conditions matched between Step 5.1.1 O2 and ri . If (m = min(m1 , m2 )) then /*min(m1 ,m2 ) returns the minimum between m1 and m2 , where m1 and m2 are respectively the number of exact precondition present in O2 and ri */ If (m1 < m2 ) then copy O2 in place of ri else discard O2 and go to step-7. end
244
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
/* next part attempts to find conflict rule of O2 (if any) in R, and then not to include O2 */ else //i.e., class values are not same Step 5.2 If the antecedent part (i.e., LHS) of O2 is identical to ri , then Step 5.2.1 discard O2 and go to step-7. end for //This segment attempts to place O2 in place of some other distinct rule of same class in R, if possible Find the lowest fitness-valued rule rw from R, having class Step 6 same as of O2 . If f(O2 ) < f(rw ) then go to step-7 else compute F(RT ) on Etrain by copying the current content of R into RT and replacing rw with offspring O2 . If F(RT ) > F(R) then copy O2 in place of rw in R else ignore O2 . Step 7 If the desired number of iterations(generations) is not completed (i.e., no itr < Max itr) then delete the parents from the mating pool and go to step-1. end Optimized version of the rule set R (in discreetized form). Output
Now, to identify distinct rule or identical rule, we are concerned about the preconditions of rule with numerical values only (i.e., other than ‘*’). For more clarity, explanations with some examples are given below. We first take the case of distinct rule. For instance, each of the rules r1 : 4 2 * 1 c1 (class) and r2 : 4 2 * 2 c1 (class) has 3 preconditions, since there are 3 pre-conditions with numeric values at each of r1 and r2 . However, r1 and r2 are not identical to each other because the fourth (from left) precondition values of r1 and r2 are not same (i.e., these are 1 and 2, respectively) although the first two pre-conditions of each match exactly. Obviously, the number of preconditions matched between r1 and r2 is here 2, it is not equal to min(3,3) = 3. Hence, both r1 and r2 are here distinct. Next consider the case of identical rule. Given two rules: r1 : 4 2 * 1 1 (class) and r2 : 4 * * 1 1 (class) may be treated as identical because the number of preconditions present in r1 and r2 are 3 and 2, respectively but they are matching only at two places. Clearly, the number of matched preconditions is here 2 (i.e., u = 2). Again, min(3,2) returns 2, and it equals to u. Hence, both the rules are identical but one may supersede the other. In other words, out of these two rules, one is the super rule of the other. Obviously, number of preconditions of r2 is 2, and it is less than that of r1 (i.e., 3). So, r2 is here treated as the super rule of r1 , and r2 (instead of r1 ) is well expected in rule set with the aim to classify more test examples. Further, overall fitness (F(R)) is basically the overall classification accuracy of the rule set (R) on the training examples, and it is defined as follows.
F(R) =
Number of training examples correctly classified by the rule set (R) × 100 Total number of training examples
(2) Actually, during genetic evolution of rules, it may also happen that a new rule covers examples which are already classified by an existing rule other than the worst rule with same class as of the new one. In such a situation, fitness score of the new one may be observed higher than that of the worst one. If this is the case, then the overall classification of rule set may be decreased if the new one takes place of the worst one. With this point in mind, this function is introduced not to include such a new redundant rule at any cost in place of the existing worst rule. In time complexity point of view, each of the steps-3–7 of the algorithm takes O(n) running time in worst case for each generation, assuming n number of rules are present in the rule set R. 3. Other classifiers To compare the strength of the proposed system DTGA, we have used here three other competent learners (each belonging to a
distinct family) namely, Neural Network (Artificial Neural Network), Naïve Bayes (Bayesian Network) and Rule-based classifier on rough set theory apart from the base classifier C4.5 (Decision Tree). A brief study on each of these is given below.
3.1. Neural network Neural networks (NNs) are often referred to as Artificial Neural network to distinguish them from biological Neural networks. It can be viewed as a directed graph with source (input), sink (output) and internal (hidden) nodes. The input nodes exist in an input layer, and the output nodes in an output layer. The hidden nodes occur in one or more hidden layers. Solving a classification problem using NNs involves the following basic steps: • Determining the number of input nodes, output nodes and the hidden layers. • Assigning weights (labels) and activation functions to be used for the graph. • For each tuple in the training set, propagate it through the network and evaluate the output prediction to the actual result. If the prediction is accurate, adjust labels to ensure that this prediction has a higher output weight. On the other hand, if the prediction is not correct, then adjust the weights to provide a lower output value for this class. It is observed that in learning from previous experiences, neural network acts as an excellent tool in prediction. However, a disadvantage is that it is difficult to design neural network architectures (mainly activation functions and trained parameters), and the learned knowledge is not shown explicitly.
3.1.1. The Naive Bayesian classifier The Naive Bayes classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods. In purpose of classification, the approach calculates probabilities of different classes given some observed evidence. It is called naïve because it assumes independence of all attributes to each other. A brief description on this method is given below. Given a data value xi the probability that a related tuple (i.e., example) ti , is in class Cj is described by probability p(Cj |xi ). Training data can be used to determine p(xi ), p(Cj |xi ), and p(Cj ). All these are prior probabilities based on previous experiences. Again, from these known probabilities, Bayes theorem allows us to estimate the posterior probability p(Cj |xi ) and then p(Cj |ti ). A form of Bayes theorem: p(X|Y ) =
p(Y |X) − p(X) p(Y )
where p(X) is the prior probability and p(X|Y) is the posterior probability. Given a training set, the Naïve Bayes algorithm first estimates the prior probability p(Cj ) for each class by counting total number of examples of each class in the training set. Again for each attribute, xi , the number of occurrences of each attribute value xi can be counted to determine p(xi ). Similarly, the probability p(xi |Cj ) can be estimated by counting how often each value occurs in the class in the training data set. It is true that a tuple in the training data may have many different attributes, each with many values. Suppose that ti (a new tuple) has q independent attributes values {xi1 , xi2 , . . .xiq }. From the descriptive phase, we know p(xiq |Cj ) for
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
each class Cj and attribute xiq . We then estimate p(ti |Cj ) as follows:
p(ti Cj ) =
q
p(xik Cj )
k=1
Now, to calculate p(ti ) based on the known probabilities, we can estimate the likelihood that ti is in each class. The posterior probability p(Cj |ti ) is then found for each class. The class with highest probability is the one chosen for the tuple. Precisely, the Bayesian classifier is easy to use and understand, and can easily handle missing values by simply omitting that probability when calculating the likelihoods of membership in each class. But it may not give always satisfactory results, as the attributes usually are not independent and the technique does not handle continuous data. Although, dividing the continuous values into ranges can be used to solve this problem, but the division of the domain into ranges not an easy task. 3.2. Rule-based classifier and rough set approach The basic criteria of any rule-based classifier is to generate IF (conditions) THEN (decision class) rules, where conditions of each rule are simply conjunctions of elementary tests on values of attributes and decision part indicates the assignment of an object (that satisfies the condition part). The aim is to cover all the training cases. That is why these techniques are sometime called as covering techniques. In this context, we may point out that a decision tree based rule inductive algorithm can always be used to generate rules of such form but the rules induced by a rule-based classifier may not necessarily build a tree, because nodes in the tree have order but rules generated from rule-based classifier have no order. Some pure rule-based examples are CN2 [45] and CL2 [46]. The rough set (RS) approach (derived from set theory) also has the capability to generate decision rules in IF–THEN form. This theory was first introduced by a Polish computer scientist Pawlak [47,48] during the early 1980s. As per this theory, data is collected in a table called decision table, where rows of a decision table correspond to objects and columns correspond to features. Usually, decision tables are difficult to analyze, since they store a huge quantity of data, which is hard to manage from a computational point of view. Moreover, some facts in the table may not be consistent to each other. So, it is necessary to reduce the size of the data. One of the main objectives of RS data analysis is to reduce data size. Interestingly, this approach has the ability to reduce redundant or inconsistent information from data table and to find minimal sets of attributes called reducts (i.e., reduction of dimensionality). A rough set classification model can be simply partitioned into three distinct phases: (i) Pre-processing phase: This phase includes tasks such as extra variables addition and computation, decision classes assignments, data cleansing, completeness, correctness, attribute creation, attribute selection and discretization. (ii) Analysis and rule generating phase: In this phase, generation of preliminary knowledge such as: computation of object reducts from data, derivation of rules from reduct, and rule evaluation and prediction processes are taken into account. (iii) Classification phase: This phase utilize the rules generated from the previous phase to predict the unseen data. Obviously, such an approach is easy to understand, offers straightforward interpretation of obtained results. On the basis of case study on rough set theory, we notice that the rough-set theory was mainly used to preprocess data and to classify objects at the beginning. Therefore, its community has
245
concentrated to construct efficient algorithms for extracting rules. But recently, it is often used within classification algorithms, i.e., it combines the merits of rough set as well as the other classifiers. Several efficient methods for creating classifiers have been introduced; for a review see, e.g., [11,15]. These classifiers are often constructed with using a search strategy optimizing criteria strongly related to predictive performance (which is not directly present in the original rough sets theory formulation). Several authors have developed their original approaches to construct decision rules from rough approximations of decision classes which joined together with classification strategies led to good classifiers, see e.g. [7,22]. In a nutshell, the complexity of classifier can be reduced and its classification performance can be improved using rough set theory, but the problem of finding a minimum-length reduct is NP-hard, so heuristics are used in searching for short reducts. 4. Experimental design and results 4.1. Experimental study This subsection describes the details of the experiments including the data sets, the algorithms used to compare and the settings of the tests. In particular, three sets of experiments are conducted in this study, namely, experiment on the proposed sampling technique, experiment for the performance of the proposed learning classifier and finally experiment for the sensitivity analysis of the learners. Five learning algorithms in total are used in our investigation. Note that C4.5 [18] is a downloaded s/w, whereas the presented GA-based algorithm is implemented in Java-1.4.1 on a Pentium4 running on Mandrake Linux OS 9.1. Regarding the other used learners, Naïve Bayes and Neural Network implementations are in the WEKA (Waikato Environment for Knowledge Analysis)-3.4.2 framework, whereas Rule-based classifier on Rough set is in RSES (Rough Set Exploration System)-2.2. All experiments are performed on the same machine. In context of DTGA, parents and crossover sites are selected randomly with a different random seed each time. The suggested cross-over site is here two-point. Also, the predefined number of generations for GA is chosen as 80. The algorithms are tested on 18 benchmark data sets of realworld problems drawn from UCI machine learning repository. Table 1 gathers the relevant features of the problems. The problems are arranged in the table in alphabetical order of their names. More importantly, the last three columns of this table are chosen to show class imbalance of the data sets. Note that imbalance ratio of each data set is computed using the formula proposed by Tanwani and Farooq [42], and shown at the 5th column of Table 1. It is once again reminded here by Eq. (3). Nc − 1 Ii , Nc In − Ii Nc
Imbalance ratio (Ir ) =
(3)
i=1
where Ii denotes the number of instances of ith class; and In , the total number of instances. On the other hand, Nc represents the number of classes present in the data set. The value of Ir lies in the range:1 ≤ Ir < ∞ and Ir = 1 implies that the data set is completely balanced having equal instances of all classes. Let us remind here that SPID4.7 discretizer converts each original data set into discrete form and handles missing values suitably. Also, it modifies slightly the number of instances and relevant attributes, whenever required. For example, the original Annealing data set has total 38 non-categorical attributes. But SPID4.7 has reduced this number to 20, since 18 attributes in the original data set contain simply missing values throughout the set. Similarly, the 5th attribute in the Heart (swis) data remains constants for all instances, and therefore it is deleted by the SPID4.7. Interestingly,
246
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
Table 1 Summary of the UCI data sets. Problem name
Number of non-target attributes
Number of classes
Number of examples
Imbalance ratio
% of minority class with minimum instances
% of majority class with majority instances
Adult Annealing Credit Dermatology Ecoli Glass Heart (Hungarian) Heart (Swiss) Iris Liver disorder New-thyroid Pima Indian Segment Soyabean-large Tae Wine Wisconsin Yeast
14 38 15 34 8 10 13 13 4 6 5 8 19 35 5 13 10 9
2 6 2 6 8 7 5 5 3 2 3 2 7 19 3 3 2 10
48,842 798 690 366 336 214 294 123 150 345 215 768 2310 307 151 178 699 1486
1.7468 2.7679 1.0246 1.0526 1.2495 1.1571 1.7389 1.1409 1.0000 1.0522 1.7673 1.2008 1.0000 1.0402 1.0005 1.0191 1.2133 1.1803
23.91 1.00 44.49 5.4 0.5 4.2 5.1 4.06 33.33 42.02 13.95 34.89 14.28 0.32 33.33 26.96 34.47 0.33
76.07 76.00 55.50 30.60 42.55 35.51 63.94 39.02 33.33 57.98 69.76 65.11 14.28 13.02 33.33 39.88 65.52 31.19
Note: For more details see Appendix A.
one may refer Appendix A for more details of the selected data sets including their class distributions. From Table 1, it is quite obvious that the selected data sets are chosen from different domains. Again, these are very varied in terms of number of classes, number of features and number of instances. The number of classes ranges up to 19, the number of features ranges from 4 to 38 and the number of instances ranges from 123 to 48,842. Again, the imbalance ratio varies from 1 to 2.7679. Further, looking into the imbalance ratios of the data sets as well as their proportions of the minority classes with minimum instances (%)) and majority classes with majority instances (%), we may comment that the selected problems are mixed of balanced and imbalanced problems. Although, some problems, namely, Adult, Annealing, Ecoli, Heart (Hungarian), Heart (Swiss), New-thyroid, Pima-Indian are comparatively much imbalanced. Note that as a case study for evaluating the proposed GA on volumetric data, we have used the Adult data set, which is one of the largest public domain data sets in the well-known data repository of the UCI. The details of the data set is described in Appendix A. 4.2. Experiment on the proposed sampling technique To verify the strength of our proposed sampling technique, we have taken 9 data sets out of the selected 18 data sets (as shown in Table 1). These are placed separately at Table 2. Note that, out of nine problems, the first seven are comparatively imbalanced, and the last two are perfectly balanced. Now, we conduct three subexperiments using C4.5 learners, namely, e1 , e2 and e3 on these nine data sets, where each of e1 , e2 and e3 consists of 10 runs. 4.3. Experiment e1 on random (30–70%) sampling At each run of e1 , we randomly pick 30% examples of each data set as training and the remaining as test sets without concentrating on class distribution. Then C4.5 is run on the same training set and the trained model is tested on the same test set. Accuracy result at each run is recorded. 4.4. Experiment e2 on equi-distributed (30–70%) sampling On the other hand, at each run of e2 , we follow our suggested splitting approach (i.e., 30–70%) (discussed in Section 2.1) to separate training and test sets. Next, C4.5 is trained on the training
set and the induced knowledge is tested on the test set. Here too, accuracy result at each run is recorded. 4.5. Experiment e3 on equi-distributed (40–60%) sampling Of greater interest, we follow our suggested splitting approach to select 40% of each class examples over the entire data set of each problem, and the remaining 60% as test set. Then we apply C4.5 learner as usual on training set to compute accuracy on respective test set. Let us remind here that accuracy (acc) is always calculated in this study using Eq. (4). Finally, 10 results for each data set found from each sub-experiment are averaged and shown in Table 2. In addition, a standard deviation (s.d.) is calculated based on the 10 results, and displayed in the table. 4.6. Experiment for the performance analysis of the proposed learning classifier To have good estimate, all the classifiers are run 20 times, each time on a distinct training set and a test set of each of 18 data sets. Note that each training set and test set against each data set are selected by our proposed data splitting strategy discussed in Section 2.1. Next, each classifier is trained on the training set, and then the trained model is run on the test set to measure accuracy. However, the training and test sets for each data set decided for each distinct run are used by all the classifiers for that run only. In other words, at every run, two distinct sets (one training set and one test set) for each data set are first decided following the suggested data sampling approach, and then individual classifier is trained and the induced knowledge is tested. Note that 80 generations are produced in each run of GA. The train accuracies on each data set achieved by individual classifier are averaged over all 20 results. In addition, a standard deviation along with each mean result is reported. Standard deviation is important, since it generalizes the overall performance of classifier. Table 3 summarizes the performance of the suggested learners on the discretized data sets. In the purpose of measuring performance of classifier at each run, simply classification accuracy of rule set is considered here. It is defined below. Accuracy =
No. of test examples correctly classified by the rule set × 100 Total number of test examples
(4)
In addition, to understand the main difference between the replaced rules and the original rules, one example with original rule
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
247
Table 2 Comparative performance of the proposed and the usual data-splitting techniques on the UCI data sets accuracy rate (%) of the following approaches consisting of 30% training data. Problem name
Random (30–70%) data sampling and C4.5 (acc ± s.d.)
Adult Annealing Ecoli Heart (Hungarian) Heart (Swiss) New-thyroid Pima Indian Iris Tae
77.08 88.15 79.15 63.16 40.25 90.11 73.10 93.02 52.02
± ± ± ± ± ± ± ± ±
2.35 3.92 5.92 6.01 7.04 3.77 3.08 3.76 5.65
82.21 92.02 84.12 69.28 44.38 94.56 78.08 95.78 55.86
set (learned by C4.5) as well as optimized rule set (i.e., refined by DTGA) of a run is a added into Appendix C. 4.7. Sensitivity analysis In real-world data sets, there are many problems in data quality, which affect the prediction of classifier. Missing data is a quite common problem of data quality in data sets. However, the capability of handling such problems varies from classifier to classifier. In this respect, sensitivity analysis can help to determine the dependency of the model on the structure and hypothesis of the environment. In fact, if a tiny change of an input leads to great changes of the output, then the model is said to be highly sensitive to that input. In this view, another experiment is carried out on all the suggested learners as follows: First, we separately select a training set (labeled as E train ) and a test data set (labeled as E test ) for each problem in addition to the 20th iteration of the just earlier experiment, following the suggested data-sampling. Then, induced knowledge by each learner on E train is tested on E test , and the overall accuracy is recorded. Next, in the purpose of sensitivity checking of the learners, we refer here the idea proposed by Lei et al. [43]. In fact, they have analyzed the changes of classification accuracies as per the variation of proportions of missing data in the data set. Based on their observation, a classification algorithm is sensitive to missing data if the change is not considerably small. Statistically, if the proportion of missing values in the dataset exceeds 20%, then there
Equi-distributed (40–60%) data sampling and C4.5 (acc ± s.d.)
Equi-distributed (30–70%) data sampling and C4.5 (acc ± s.d.) ± ± ± ± ± ± ± ± ±
1.46 2.52 4.64 5.26 6.46 3.26 1.98 3.35 5.02
83.45 91.83 79.43 71.28 44.21 93.12 79.32 97.25 57.35
± ± ± ± ± ± ± ± ±
1.21 2.21 5.49 4.34 6.16 3.52 1.61 3.04 4.82
is an obvious decrease in the classification accuracy. That is why we have intentionally missed a 25% of examples in the data set (E test ) of each problem by randomly incorporating missing value like‘?’. Then, each of the suggested classifiers is simply run on the same corrupted test set (denoted as E test ) to measure again the overall classification accuracy (t2 ). Finally, in Table 4, the decrement ( = t1 − t2 ) of the accuracies achieved by each classifier corresponding to each problem is shown within parenthesis along with the overall accuracy (t1 ) achieved from uncorrupted dataset E test . 4.8. Results and analysis In this section, we provide the results of the experiments and discuss the performances of the sampling approach and the learning algorithms over the data sets. The approach is implemented in Java-1.4.1. The results of Table 2 show that the proposed sampling (Equi-distributed: 30–70%) claims to yield better learning performance over any data set in comparison to random (30–70%) sampling as well as (Equi-distributed: 40–60%) sampling and but it achieves better performance in comparison (Equi-distributed: 40–60%) or more when the data sets are imbalanced. In addition, Appendix D also is included to provide evidences of its strength. Now, based on the results of Table 2 and the evidences shown at Appendix D, we may hopefully demand that the suggested sampling technique is an essential pre-requisite component for learning model but we may not ensure that it is one optimal solution in this respect.
Table 3 Comparative performances of NN, Naïve Bayes, C4.5, DTGA and rule-based classifier using rough set on the UCI data sets accuracy rate (%) of the following approaches (acc = average accuracy, s.d. = standard deviation). Problem name
NN (acc ± s.d.)
Adult Annealing Credit Dermatology Ecoli Glass Heart (Hungarian) Heart (Swiss) Iris Liver disorder New-thyroid Pima Indian Segment Soyabean-large Tae Wine Wisconsin Yeast
73.23 92.17 82.91 94.65 85.65 69.94 69.43 43.02 97.22 72.71 94.16 77.01 95.13 86.34 57.02 97.41 94.32 51.34
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
2.43 2.45 2.39 3.12 4.32 7.74 6.14 5.12 2.38 3.61 3.38 2.35 2.96 4.92 5.86 2.27 2.38 2.72
Naïve Bayes (acc ± s.d.) 79.20 85.61 84.55 95.14 86.21 70.38 73.03 40.73 96.17 76.06 95.02 78.63 89.21 87.94 58.94 97.62 95.06 55.88
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
2.07 3.89 2.05 2.89 3.82 7.08 5.17 5.29 2.61 2.97 3.08 2.05 4.38 4.56 4.22 2.38 1.98 3.87
C4.5 (acc ± s.d.) 81.28 92.46 82.15 94.78 85.72 72.85 71.81 43.87 96.64 74.86 94.46 77.28 94.47 87.89 57.97 97.12 94.91 53.98
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
1.94 2.37 2.43 3.07 4.12 7.24 5.81 6.92 3.21 3.08 3.07 2.26 3.18 4.32 5.38 2.69 2.17 4.07
DTGA (C4.5 + GA) (acc ± s.d.) 85.08 95.05 89.67 97.61 90.86 79.37 78.08 52.82 98.02 80.02 97.82 84.92 97.05 91.47 61.98 98.11 98.07 59.23
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
1.04 1.94 1.43 2.45 3.46 4.59 4.63 3.62 1.82 1.85 2.35 1.37 2.56 2.58 3.14 1.37 1.13 1.97
Rule-based classifier using rough set (acc ± s.d.) 82.26 90.84 83.27 96.12 82.78 71.87 60.82 39.78 95.37 76.07 91.33 79.81 94.56 86.31 52.13 96.78 97.55 47.25
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
1.12 2.12 2.58 2.34 4.96 7.69 6.34 7.12 2.87 2.26 3.65 1.91 2.95 3.97 5.31 1.94 1.22 4.72
248
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
Table 4 Performance of five classifiers on the corrupted test data sets [t1 = accuracy (%) over E test (without missing data), t2 = accuracy (%) over corrupted E test (i.e., a 25% of missing data in E test ), ( = t1 − t2 )]. Problem name
NN (t1 , )
Naïve Bayes (t1 , )
C4.5 (t1 , )
DTGA (C4.5 + GA) (t1 , )
Rule-based classifier using rough set
Adult Annealing Credit Dermatology Ecoli Glass Heart (Hungarian) Heart (Swiss) Iris Liver disorder New-thyroid Pima Indian Segment Soyabean-large Tae Wine Wisconsin Yeast
69.18 (4.06) 90.72 (4.63) 83.87 (3.57) 95.11 (0.16) 85.14 (3.74) 70.43 (4.34) 69.31 (0.96) 42.76 (1.43) 96.13 (2.06) 72.17 (2.59) 94.35 (3.97) 76.21 (2.32) 94.12 (2.82) 86.07 (1.46) 56.92 (1.49) 95.14 (1.42) 95.92 (1.10) 52.12 (2.03)
78.34 (1.79) 83.40 (3.16) 85.20 (2.37) 96.41 (1.19) 85.62 (3.34) 70.67 (3.89) 71.82 (0.00) 40.38 (1.37) 94.08 (2.41) 75.12 (2.42) 96.02 (2.21) 78.13 (1.59) 89.79 (1.37) 86.92 (0.83) 57.83 (2.76) 96.17 (0.67) 96.11 (0.73) 55.87 (0.58)
80.56 (2.06) 92.55 (2.44) 82.14 (2.92) 95.58 (1.67) 85.31 (4.21) 71.56 (5.82) 70.87 (2.44) 43.96 (1.25) 94.22 (2.71) 73.85 (3.03) 95.83 (1.46) 77.26 (2.30) 93.77 (1.68) 86.76 (2.65) 56.12 (2.83) 95.46 (1.22) 94.16 (0.89) 54.71 (2.62)
84.09 (1.16) 95.21 (2.03) 88.62 (1.97) 97.38 (0.96) 91.34 (3.78) 79.21 (3.92) 77.21 (0.43) 52.15 (0.00) 97.08 (1.94) 79.83 (2.25) 97.44 (1.04) 83.76 (1.78) 97.47 (1.19) 91.61 (2.08) 60.53 (2.97) 97.71 (1.07) 96.74 (0.78) 60.13 (1.58)
81.07 (2.86) 87.67 (3.05) 83.20 (1.68) 95.79 (1.02) 80.73 (2.72) 71.13 (3.43) 59.22 (0.89) 40.09 (0.92) 94.73 (2.18) 75.04 (2.82) 90.82 (2.38) 77.04 (1.94) 94.73 (1.25) 85.73 (1.13) 49.13 (2.32) 95.92 (1.04) 96.41 (0.72) 46.77 (2.49)
On the basis of the results of Tables 3 and 4, we summarize the strong points of our proposed system as follows: • First of all, just looking into the performance Table 3, it is obvious the DTGA classifier achieves better average accuracy results than the other three competent learners: Neural Network, Naïve Bayes and Rough-set based Rule inducer over any kind of data set irrespective to domain, size, dimensionability and imbalance nature of data. Again, the accuracy rate of DTGA is significantly larger than the accuracy rate of pure C4.5 on all the 18 data sets. In other words, any of the selected learners did not improve classification accuracy on any data set so much as the DTGA has bagged. • Secondly, with respect to the results on the more likely imbalanced data sets (shown at Table 3), we see that DTGA dominates heavily its base classifier C4.5 in al most all cases, and the other classifiers: Neural Network, Naïve Bayes and Rough-set based Rule inducer in many cases such as: Adult, Ecoli, Heart (Hungarian), Heart (Swiss), Pima-Indian. This reveals that DTGA is a good solution for imbalanced data sets too. • Thirdly, on considering the presence of default rule in each of the rule sets of C4.5 and DTGA, one may easily realize from the results of Table 3 that the rate of mis-classification (%) of DTGA on each data set in comparison to pure C4.5 is minimized enough, since overall accuracy (%) achieved by DTGA is successfully improved than that of C4.5. This implies that DTGA has high capability to remove noisy rules. Especially, this proves the strength of the suggested fitness function. • Fourthly, as can be seen in the performance Table 3, the DTGA achieves smaller standard deviation nearly in all the selected problems as compared to the other used learners, which confirms much reliability for prediction. • Again, analyzing the results of Table 4 for testing sensitivity characteristic of the classifiers, it is true that Naïve Bayes is the least sensitive, whereas NN is the most sensitive to missing data among the suggested five classifiers. Although, DTGA is showing less sensitivity than each of C4.5, NN and the rule based classifier (using rough set theory) almost in all cases. More importantly, we see that sensitivity of DTGA on most of the selected unbalanced data sets such as: Adult, Annealing, Heart (S), New-Thyroid (as shown in Table 2), is very less in comparison to the other learners. • As stated above that rough set theory is a powerful tool for finding hidden patterns in data as well as minimal sets of data (data reduction), generating sets of decision rules from data. Even, it has the
ability to generate rules from minority classes. But the results of Table 3 interprets that the rule-based classifier using rough set theory also is unable to compete with DTGA. Although, in many cases, it is showing better performance in comparison to pure C4.5, NN, and Naïve Bayes. • Lastly, comparing the performances of the used learners on larger data set Adult, we are hopeful that DTGA based on the suggested sampling technique is more likely successful to operate volumetric data set, because GA-based system usually suffers from improvement of accuracy over larger data set. Certainly, this is a promising feature in favour of DTGA. Furthermore, of greater interest, we present below a short comparative study between DTGA and n2 -classifier on the empirical results of five data sets: Ecoli, Glass, Iris, Soybean-large and Yeast. Actually, the n2 -classifier was first introduced by Jelonek et al. [10]. This kind of classifier is a specialized approach to solve multiple class learning problems. It is composed of (n2 − n)/2 base binary classifiers (where n is a number of decision classes; n > 2). The results on Ecoli, Glass, Iris, Soybean-large and Yeast obtained by n2 -classifier [22] are respectively 81.34 ± 1.7, 74.82 ± 1.4, 95.53 ± 1.2, 91.99 ± 0.8, 55.74 ± 0.9, where each result shows the mean accuracy over 10 folds with standard deviation. Now, consulting Table 3 for direct comparison between DTGA and n2 -classifier on these datasets, we may easily comment that the performance of DTGA is better than the n2 -classifier. Although, the standard deviation results found by n2 -classifier on these data sets are no doubt good in comparison to DTGA. 5. Conclusion and future work In particular, the investigations introduced in [17,19] and in the present article are all associated with refining/optimizing rules learned by C4.5 by applying GA. Shortly speaking, all these are mainly focusing on designing new novel evaluation function, variation on encoding/decoding strategy as well as crossover operation to improve the overall classification accuracy irrespective to the domain, size and dimensionality of the data sets in comparison to varieties of classifiers. Besides, data sampling is another interesting phenomenon of the approaches to reduce the biasness towards training set as well as to overcome imbalance problem occurred in data sets to a great extent. Now, in context of the present article, we summarize the main points as follows.
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
• Through experiments, it is observed that our proposed hybrid learning system DTGA shows better classification performance in most cases as compared to the existing learners suggested in this study. Simultaneously, the rate of classification-error achieved by DTGA on each data set is less as compared to its base learner C4.5. • Again, the proposed system is comparatively less sensitive to missing data in comparison to C4.5 and NN, and fairly close to Naïve Bayes. • To overcome the imbalance problem of data set, we mainly concentrate on data sampling in which much attention is paid to maintain almost equal proportion of class distribution in the training set. Such a technique claims also to select overall informative and balanced features for classification at the very beginning of learning. In addition, replacement of rule of specific class is here performed only in place of the worst rule with same class in the rule set, and this characteristic also takes care to improve classification performance even if the data set is imbalanced. • Moreover, the approach easily handles the interpretability problem that occurs in most of the learning classifier system for applying GA. • Finally, DTGA claims less time complexity as compared to most of the GA based approaches, since any encoding and decoding technique is not applied here and the best results are usually achieved within 80 generations. In future, the proposed method can be parallelized to further improve the prediction accuracy of DTGA especially over volumetric data sets, covering large search space in order to find better quality solutions. Further, decimal values of the rules can be encoded to maintain the diversity of the population for any kind of problem through different genetic operators in less amount of time. Acknowledgement The authors are grateful to Ambuj Kumar, former student of the Department of Computer Science and Engineering, BIT, Mesra, for implementation of the genetic algorithm proposed in this study. Appendix A. A.1. Data set description All the datasets used in the experimental evaluation are available from the University of California, Irvine, (UCI) Repository of Machine Learning Databases. Each data set (i.e., problem) is briefly described. For more information on the Repository, consult http://www.ics.uci.edu/∼ mlearn/MLRepository.html. However, all the datasets are first discretized by SPID4.7 discretizer, and then used in our experiment.Note: Expression of the form x(y) indicates that there are ‘y’ number of entries for class value/code/name ‘x’. As for example, 4(60) indicates that there are 60 instances of class value 4, i.e., it is used to show class distribution. A.1.1. Adult The Adult data set (originally called the “Census Income” Database of United State Census Bureau) predicts whether income of an individual belonging to a certain category based on census data exceeds $50 K/yr. The original database contains of 48,842 observations of US citizens, each has 14 non-target attributes and one binary valued target attribute (called as income). In fact, the levels of income are considered as: < = 50 K (small) and > 50 K (large). However, due to the limitation of memory, 2/3rd of this volumetric data set (i.e., 32,561 out of 48,842 is considered in our experiment. The class distribution of the reduced data set is shown below.
249
Class distribution: 0(24,720), 1(7841), where 0 represents 50 K (small) and 1 represents > 50 K (large), whereas it is 37,155(0) 11,687(1) with respect to the original data set. A.1.2. Annealing This database has 798 instances with total 38 non-categorical attributes and 6 different classes. However, 18 attributes (out of 38) have missing values throughout the database, and so these are deleted by the discretizer adopted in our experiment. Distribution of Classes: 1(8), 2(88), 3(608), 4(0), 5(60), U(34). A.1.3. Credit This problem concerns credit card applications. It has 15 nontarget attributes, one target attribute with 2 class values (+ and −) and 690 observations in total. Class distribution: +(307), −(383). A.1.4. Dermatlogy The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are, namely, psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris. Usually a biopsy is necessary for the diagnosis but unfortunately these diseases share many histopathological features as well. This database contains 366 examples and 34 non-categorical attributes (33 of which are linear valued and one of them is nominal). Class distribution: Psoriasis (112), seboreic dermatitis (61), lichen planus (72), pityriasis rosea (49), cronic dermatitis (52), pityriasis rubra pilaris (20). A.1.5. Ecoli Data give characteristics of each ORF (potential gene) in the Ecoli genome. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided in the databases. The data set contains 8 non-target attributes, 336 observations with 8 classes. However, the 5th attribute takes same value almost in all the records of the databases, and so it is deleted by our discretizer at the pre-processing phase. Class distribution: cp (cytoplasm) (143), im (inner membrane without signal sequence) (77) pp (perisplasm) (52), imU (inner membrane, uncleavable signal sequence) (35), om (outer membrane) (20), omL (outer membrane lipoprotein) (5), imL (inner membrane lipoprotein) (2), imS (inner membrane, cleavable signal sequence) (2). A.1.6. Glass This data set is donated Vina Spiehler from Diagnostic Products Corporation. Basically, the study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified. Originally, the data set has 214 examples, each of which contains 10 non-target attributes and takes one of 7 class values. However, the first attribute basically gives an id number (record number), and so it has no importance in the task of classification. Therefore, it is simply deleted from our experiment. Again, no data of glass-type-4 is found in the database, and so it is also ignored in our experiment.Type of glass: (class attribute): (1) building windows float processed (76), (70), (2) building windows non float processed (3) vehicle windows float processed (17), (4) vehicle windows non float processed (0), (5) containers (13), (6) tableware (09), (7) headlamps (29).
250
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
A.1.7. Heart (Hungary and Swiss) The data set are donated by David W. Aha. This database contains information about heart from regions like Cleveland, Hungary, Switzerland, etc. Although, the number of available examples varies from region to region. Interestingly, out of around 76 attributes, most of experiments are using a subset of 14 relevant attributes (including the class attribute) available in processed form. The “goal” field refers to the presence of heart disease in five forms in the patient. It takes an integer value out of 5, starting from 0 (absence of heart disease) to 4. In particular, the Heart (Hungary) and Heart (Swis) data sets contain respectively 294 and 123 cases. Class distribution Hungarian Switzerland
0 188 8
1
2
3
37 48
26 32
28 30
4 15 5
Total 294 123
A.1.8. Iris (Iris plants) This data set corresponds to the well known Iris data set by R.A. Fisher. It contains total 150 patterns in four dimensions measured on Iris flowers of 3 different species, where each class refers to a type of iris plant. Class distribution: Iris Setosa (50), Iris Versicolour (50), Iris Virginica (50) A.1.9. Liver disorder The liver disorder data set was donated by Richard S. Forsyth from the collected data from BUPA Medical Research Ltd. This problem has 345 instances with 6 non-target attributes. The first 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. Each line in the bupa.data file constitutes the record of a single male individual. The selector field used to split data into two sets: 1 and 2. Class distribution: 1(145), 2(200). A.1.10. New thyroid It is a medical data set regarding thyroid gland. This data set consists of 215 points from 3 classes: 1(normal), 2(hyper) and 3(hypo) in a five-dimensional spaces. Class distribution: 1(150), 2(35), 3(30). A.1.11. Pima-Indian The problem is to predict whether a patient would test positive for diabetes given a number of pathological measurements and medical test results. The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 h post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The patients in the database are females at least twenty-one years old of Pima Indian heritage, living near Phoenix, Arizona, USA. There are 2 classes, 8 numerical attributes and 768 records in the database. Class distribution: (class value 1 is interpreted as “tested positive for diabetes”, 0 for negative): 0(500), 1(268). A.1.12. Segment The problem consists on identifying an outdoor image. The instances were drawn randomly from a database of 7 outdoor images. The images were hand segmented to create a classification for every pixel as one of brickface, sky, foliage, cement, window, path or glass. It was used in the Statlog project. There are 7 classes, 19 numerical attributes and 2310 records. Classes: brickface, sky, foliage, cement, window, path, grass. There are 330 instances per class in the data file.
A.1.13. Soybean There are 35 non-categorical attributes (some nominal and some ordered) and 19 classes in this problem. Usually, the first 15 classes (out of 19) have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. The database has 307 instances. Class distribution: (1) diaporthe-stem-canker (10), (2) charcoalrot (10), (3) rhizoctonia-root-rot (10), (4) phytophthora-rot (40), (5) brown–stem–rot (20), (6) powdery-mildew (10), (7) downymildew (10), (8) brown-spot (40), (9) bacterial-blight (10), (10) bacterial-pustule (10), (11) purple-seed-stain (10), (12) anthracnose (20), (13) phyllosticta-leaf-spot (10), (14) alternarialeaf-spot (40), (15) frog-eye-leaf-spot (40), (16) diaporthe-pod-&-stemblight (6), (17) cyst-nematode (6), (18) 2-4-d-injury (1), (19) herbicide-injury (4).
A.1.14. TAE (teaching assistant evaluation) The data set consists of evaluations of teaching performance over three regular semesters and two summer semesters of 151 teaching assistant (TA) assignments at the Statistics Department of the University of Wisconsin–Madison. The scores were divided into 3 roughly equal-sized categories (“low (1)”, “medium(2)”, and “high(3)”) to form the class variable. Class distribution: 3(52), 2(50), 1(49).
A.1.15. Wine The donor of this set is Stefan Aeberhard. However, the original owner is M. Forina et al., PARVUS an extendible package for Data Exploration, Classification and Correlation, Institute of Pharmaceutical and Food Analysis and Technologies, Brigata Salerno, 16147 Genoa, Italy. These data are, in fact, the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents(non-target attributes) found in each of the 3 types of wines. It contains 178 instances. Class distribution: class-1(59), class-2(71), class-3(48).
A.1.16. Wisconsin This one is the breast cancer(original) databases available at UCI. It was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. The data contains various breast biopsy measurements collected by oncologists through fine needle aspiration. Its objective is to predict whether a tissue sample taken from a patient’s breast is malignant or benign. There are 2 classes, 10 numerical attributes and 699 observations in the databases. However, the first attribute in the data file denotes an id-number (simply a sample code), and so it is less practicable in context of classification. Therefore, it is ignored in our present experiment. Class distribution: Benign (458), Malignant (241).
A.1.17. Yeast This data set gives a statistic on protein with 1484 instances with 10 classes, each consisting of 9 non-target attributes. Although, the 7th attribute remains almost same throughout the observations, and so it is ignored in many experiments, and even our too. Class distribution: CYT (cytosolic or cytoskeletal) (463), NUC (nuclear) (429), MIT (mitochondrial) (244), ME3 (membrane protein, no N-terminal signal) (163), ME2 (membrane protein, uncleaved signal) (51), ME1 (membrane protein, cleaved signal) (44), EXC (extracellular) (37), VAC (vacuolar) (30), POX (peroxisomal) (20), ERL (endoplasmic reticulum lumen) (5).
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
Appendix B. In this section, a tiny classification problem named golf-playing is presented to provide clear ideas on several points discussed throughout the article. Truly speaking, this is one of the problems on which the interface [26] s/w is run. This will help to have a general idea on classification problem.
B.1. Problem Golf-playing.
B.2. Description The problem is mainly dependent on weather. The problem describes under what conditions of weather, game golf is suitable for playing. In general, instances in a dataset are characterized by the values of features, or attributes, that measure different aspects of the instance. In this case, four non-target attributes: outlook, humidity, temp, and windy are taken into account to forecast whether playing golf is possible on a day or not. In this purpose, observed values of each these attributes against each day of some days are necessary. A sample data set consisting of different 14 days’ observations is given below. Also, a description consisting of possible values of the attributes is shown.
B.3. Non-target attributes with possible values Outlook: sunny, overcast, rain, Humidity: continuous, Temperature: continuous, Windy: true, false.
B.4. Target attribute Playing-decision: Play, Don‘t. All these attributes are stored in a file say golf.name.
B.4.1. Training examples from original ‘golf.data’ Day
Outlook
Humidity
Temp
Windy
Playing-decision
1 2 3 4 5 6 7 8 9 10 11 12 13 14
sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain
85 88 82 94 70 65 66 85 70 72 65 90 75 80
46 45 42 25 20 15 14 28 15 36 25 22 41 21
false true false false false true true false false false true true false true
Don‘t Don‘t Play Play Play Don‘t Play Don‘t Play Play Play Play Play Don‘t
Note that the attributes of the data set contain both nominal and continuous values. Such attributes can be discretized to reduce the range of values in case of continuous attributes, and to normalize the nominal attributes such as: 1 for high, 2 for normal, etc. Let us assume their values as follows: Playing-decision
Play(1)
yes
Don‘t(0): no
Outlook Humidity Temperature Windy
sunny(1) high(1): ≥ 75, hot(1): >36 true(1)
overcast(2) normal(2) mild(2): (20, 36) false(2)
rain(3) <75 cool(3): ≤20
251
B.4.2. Discretized ‘golf.data’ As per the above-assumed ranges as well as their respective mapped values for the attributes, we get the following discreetized data set corresponding to the above original data set. Day
Outlook
Humidity
Temp
Windy
Playing-decision
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 1 2 3 3 3 2 1 1 3 1 2 2 3
1 1 1 1 2 2 2 1 2 2 2 1 2 1
1 1 1 2 3 3 3 2 3 2 2 2 1 2
2 1 2 2 2 1 1 2 2 2 1 1 2 1
0 0 1 1 1 0 1 0 1 1 1 1 1 0
Note that the role of discretizer is to provide discretized values of continuous and nominal attributes of each data set, following its own strategy (i.e., range values corresponding to discrete values for continuous attributes in data set may vary from discretizer to discretizer). However, in the present experiment, we the selected data sets are discretized by SPID4.7 discretizer.
B.4.3. Rules generated by C4.5 Five primary rules (in IF–THEN structure) induced by C4.5 are shown below. The conclusion of each rule shows the class as well as information (measured by %) in the format: class value [% of training instances covered]. The classifier itself arranges the produced rules in descending order of information covered but grouped by class. Rule 3: Outlook = 2 → class 1 [70.7%]. [This rule is realized as: IF (Outlook = 2) THEN (Playingdecision = 1)]. Rule 2: Humidity = 2 → class 1 [66.2%]. [This rule is realized as: IF (Humidity = 2) THEN (Playingdecision = 1)]. Rule 5: Outlook = 3 AND Windy = 2 → class 1 [63.0%]. [This rule is realized as: IF (Outlook = 3) AND (Windy = 2) THEN (Playing-decision = 1)]. Rule 1: Outlook = 1 AND Humidity = 1 → class 0 [63.0%]. [This rule is realized as: IF (Outlook = 1) AND(Humidity = 1) THEN (Playing-decision = 0)]. Rule 4: Outlook = 3 AND Windy = 1 → class 0 [50.0%]. [This rule is realized as: IF (Outlook = 3) AND (Windy = 1) THEN (Playing-decision = 0)] [Target attribute is the playing-decision, and target classes ‘0’ and ‘1’ mean here playing-decision ‘No’ and ‘Yes’, respectively.] Obviously, the above rules can be represented by a disjunctive normal form (by the symbol V) of conjunctions(by the symbol , i.e., logical AND) of constraints on the attribute values of the instances as follows. Rule-1 V Rule-2 V Rule-3 V Rule-4 V Rule-5(it is, in fact, a disjunctive normal form), where Rule-1 (one disjunct) is again collection of conjuncts(i.e., pre-conditions) connected by the symbol such as: (outlook =1) (humidity =1) → class 0, and so on.
B.4.4. Formation of discreetized rule set by Interface corresponding to the above rule set, eliminating IF–THEN clause, and it is shown below:
252
Rule 3 Rule 2 Rule 5 Rule 1 Rule 4
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
Outlook
Humidity
Temp
Windy
Playing-decision
2 * 3 1 3
* 2 * 1 *
* * * * *
* * 2 * 1
1 1 1 0 0
The symbol ‘*’ in a rule denotes here the don’t care symbol, and means that the attribute corresponding to ‘*’ has no importance in that rule. For example, in Rule-3, only one condition consisting of outlook attribute is present. Therefore, conditions corresponding to the attributes: humidity, temp and windy(following their positions in the golf.data) are marked ‘*’. Further, length of each of these rules is considered here 5.
rule set (generated by C4.5 and the hybrid system) corresponding to each data set. Clearly, based on the presence of default rule in each rule set, we may claim that the error-rate (%) of each such rule set on a data set equals to the difference between 100 and the obtained accuracy (%) on that data set. The new rules with new values of attributes are being generated due to crossover, and because of replacement of the worst rules in the existing rule set are being appeared in the replaced rule set.
Appendix D. Appendix C. Let us first analyze the proposed sampling technique choosing different percentage of training data on the Heart (Hungarian) data set. The class distribution of the data set is as follows:
C.1. An illustration on replacement of Rule Consider Heart (Hungarian) data set. For this problem, most of the experiments are using a subset of 14 (13 non-categorical and 1 categorical) attributes. The categorical or goal field refers to the presence of heart disease in 5 forms in the patient. These are in fact integer values starting from 0 (absence of heart disease) to 4. In this experiment too, 13 non-categorical attributes (denoted as A1 , A2 , . . ., A13 ) and the class attribute (denoted by C) are chosen. The original rule set (learned by C4.5) and the replaced rule set (i.e., refined by GA part of DTGA) corresponding to this problem are shown here side by side in tabular-like form. Obviously, the rules induced by C4.5 are first passed to Interface to eliminate IF–THEN form (as discussed in Appendix B). Let us remind here again that the attribute-condition consisting of ‘*’ has no importance in the rule. It is nice to say that the underlined rules in the replaced rule set imply that these rules are the replaced rules corresponding to the respective rules in the original rule set at a specific run. This is basically shown to give an idea on the replaced rules. The replaced set is here, in fact, captured at a specific run (consisting of 80 generations) of DTGA, whereas the original rule set is the rule set of its base learner C4.5 at that particular run.
Original Rule Set
Class distribution Switzerland
0 188
1 37
2 26
3 28
4 15
=294
Case-I: 30% training data based on equal proportion in class distribution. Number of examples in the training set in this case is as follows: Class-value:
0
1
2
3
4
Number of examples: Ratio in the training set:
57 0.619
12 0.142
08 0.087
09 0.098
05 0.054
= 91
Case-II: 40% training data based on equal proportion in class distribution. Number of examples in the training set in this case is as follows: Class-value:
0
1
2
3
4
Number of examples: Ratio in the training set:
76 0.633
15 0.125
11 0.091
12 0.010
06 0.050
= 120
Case-III: 50% training data based on equal proportion in class distribution.
Rule set generated and replaced by GA
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 C
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 C
* * * *
2
*
*
*
2 5
*
*
*
3
* * * * 2 * * *
2
5 *
*
*
3
1 1 * 1
*
*
*
* * *
*
*
*
1
* 2 * * 1 * *
*
*
* *
* *
1
* 2 * *
*
*
*
* * * *
*
2
1
* * * * * *
* *
*
* *
* 2
1
* * * *
*
*
*
* 2 6
*
*
*
4
* * * * *
*
* *
2
6 *
* *
4
* * * *
1
*
*
* 2 5 *
*
* 2
*
*
* *
*
* * * 1
5 *
* *
2
6 * * *
*
*
*
* 2 * *
*
* 2
6
*
* *
*
* * *
2 * *
* *
2
* 2 * *
*
*
3
* * *
*
*
* 1
*
2
* *
*
* 3 *
* * *
* *
1
* * * *
* *
*
* 1 * *
*
3 0
*
*
* *
*
* * *
1 * *
* 3
0
* * * *
* *
*
* * * *
*
* 0
*
*
* *
*
* * * * * *
* *
0
Note that the last rule in the above list of rules is the default rule (i.e., a rule with majority class). In fact, it is originally generated by C4.5 for each data set, and placed at the end of the rule set. It participates to measure the overall accuracy of the rule set, but any kind of operations such as: selection, cross-over, replacement, etc. adopted in our investigation is not performed on it. In other words, it is unchanged throughout our experiment. Let us remind here that matching of input for finding accuracy is performed in sequential manner, i.e., for a given input we first check rule-1. If matches, then fine, else go to rule-2 and so on. When all the rules fail, then the default rule comes into action. This strategy is followed only for each
Number of examples in the training set in this case is as follows: Class-value:
0
1
2
3
4
Number of examples: Ratio in the training set:
94 0.630
19 0.1275
13 0.087
15 0.100
08 0.0536
= 149
Case-IV: 60% training data based on equal proportion in class distribution
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
Number of examples in the training set in this case is as follows: Class-value:
0
1
2
3
4
Number of examples: Ratio in the training set:
113 0.634
23 0.129
16 0.089
17 0.095
09 0.050
Equi-distributed 30% training data
Problem name
Equidistributed (30–70%) sampling (Case-I)
Equidistributed (40–60%) sampling (Case-II)
Equidistributed (50–50%) sampling (Case-III)
Equidistributed (60–40%) sampling (Case-IV)
Ecoli Heart (Hungarian) Heart (S)
82.60 (17) 68.40 (09)
78.80 (22) 70.50 (12)
81.80 (25) 72.40 (14)
82.95 (29) 73.20 (17)
43.40 (09)
45.50 (12)
38.40 (14)
41.20 (17)
= 178
Note that here ceil operator is used for including examples of each class type. The above analysis shows that ratios of the minority classes (especially in case of imbalance data set) decrease and the majority classes increase, as the percentage (%) of training set increases. Number of rules generated by C4.5 from different training sets of Heart (Swiss) data sets at a particular run are recorded in the table given below. Also accuracy results of the rule sets on their respective test sets are shown. The rule sets in tabular like form (after applying Interface s/w) also are presented below. Heart (Swiss) data set Random 66% training data
Property
Random 30% training data
Number of C4.5 rules on the training data sets Accuracies of C4.5 rule sets on the respective test sets
10
10
12
36.78%
45.98%
42.86%
Rule set generated by C4.5 over random 30% training data A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A12
C
* * * * * * * * * *
* * * * * * * * * *
* * 2 4 * * * * 4 *
3 * * 2 * 5 3 * 4 *
* * * * * * * * * *
* * * * * * * * * *
3 * * * * * 1 * * *
* * * * * * * * * *
* 4 * * 3 2 * 5 2 *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
2 2 0 1 1 1 3 3 3 1
Rule set generated by C4.5 over equally distributed 30% training data A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A12
C
* 3 2 3 * * * * * *
* * * * * * * * * *
* * 4 * 3 * * * 4 *
2 4 * 5 * * 2 3 * *
* * * * * * * * * *
* * * * * * * * * *
3 * * * * * 2 * * *
* * * * * * * * * *
2 * * * * 5 * * * *
* * * * * * * * * *
* * * * * * * * 3 *
* * * * * * * * * *
1 2 2 2 0 3 3 3 4 1
Rule set generated by C4.5 random 66% training data A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A12
C
* * * * * * * * * * * *
* * * * * * * * * * * *
* * 4 * * 3 3 4 * * * *
* 2 5 * * * * * 3 * 4 *
* * * * * * * * * * * *
* * * * * * * * * * * *
* * * * * * 2 * 1 * * *
* * * * * * * * * * * *
1 2 2 * 3 5 * 5 * * * *
2 * * 3 * * * * * * * *
* * * * * * * * * * * *
* 3 * * * * * * * 2 * *
1 1 1 1 1 1 1 3 3 3 2 2
Note that default rule has here class value 2 instead of 1, since in the training set number instances with class value 2 dominates the number of instances with class value 1.
Accuracy (%) of C4.5 on the respective test data set at a run.
253
Value within parenthesis indicates the number of rules generated by C4.5 over the respective training set.
References [1] C. Blake, E. Koegh, C.J. Mertz, Repository of Machine Learning, University of California at Irvine, 1999. [2] P.K. Chan, S.J. Stolfo, A comparative evaluation of voting and meta-learning on partitioned data, in: Proceedings of the 12th International Conference on Machine Learning, San Francisco, 1995, pp. 90–98. [3] K.J. Cherkauer, Human Expert Level Performance on a Scientific Image Analysis Task by a System Using Combined Artificial Neural Networks, Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms (1996) 15–21. [4] T.G. Dietterich, Ensemble methods in machine learning, in: Proceedings of the 1st International Workshop on Multiple Classifier Systems, LNCS vol. 1857, Springer Verlag, 2000, pp. 1–15. [5] U.M. Fayyad, G. Piatetsky Shapiro, P. Smyth, R. Uthurusamy, From data mining to knowledge discovery, Advances in Knowledge Discovery and Data Mining (1996) 1–36. [6] J. Gama, Combining Classification Algorithms, Ph.D. Thesis, University of Porto, 1999. [7] J.W. Grzymala-Busse, LERS—a system for learning from examples based on rough sets intelligent decision support, in: Handbook of Applications and Advances of the Rough Sets Theory, Kluwer, 1992, pp. 3–18. [8] T. Hastie, R. Tibishirani, Classification by Pair Wise Coupling Advances in Neural Information Processing Systems, NIPS97, 10, MIT Press, 1998, pp. 507–513. [9] L. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley, 2004. [10] J. Jelonek, J. Stefanowski, Experiments on solving multi-class learning problems by the n2 classifier, in: Proceedings of 10th European Conference on Machine Learning ECML 98, LNAI vol. 1398, Springer Verlag, 1998, pp. 172–177. [11] W.J.M. Klosgen, Handbook of Data Mining and Knowledge Discovery, Oxford Press, 2002. [12] C. Merz, Combining classifiers using correspondence analysis, Advances in Neural Information Processing Systems 10 (1998) 33–58. [13] R.S. Michalski, I. Bratko, in: M. Kubat (Ed.), Machine Learning and Data Mining, John Wiley & Sons, 1998. [14] W. Jerzy Grzymala-Busse, S. Jerzy, W. Szymon, A comparison of two approaches to data mining from imbalanced data, Journal of Intelligent Manufacturing 16 (2005) 565–573. [15] T.M. Mitchell, Machine Learning, McGraw-Hill, 1997. [16] S. Nowaczyk, J. Stefanowski, Experimental Evaluation of Classifiers Based on Combiner Strategy, Institute of Computing Sciences, Poznan University of Technology, Research Report RA001/02, March 2002. [17] B.K. Sarkar, S.S. Sana, K.S. Choudhury, Accuracy Based, Learning classification system, International Journal of Information and Decision Sciences 2 (1) (2010) 68–85. [18] J.R. Quinlan, C4.5. Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1995. [19] B.K. Sarkar, S.S. Sana, A hybrid approach to design efficient learning classifiers, Journal, Computers and Mathematics with Applications 58 (2009) 65–73. [20] J. Stefanowski, Algorithms of Rule Induction for Knowledge Discovery, Habilitation Thesis Published as Series Rozprawy no. 361, Poznan Univeristy of Technology Press, Poznan, 2001. [21] J. Stefanowski, The bagging and n2 classifiers based on rules induced by MODLE, in: Proceedings of the 4th International Conference Rough Sets and Current Trends in Computing, RSCTC’2004, LNAI vol. 3066, Uppsala, Sweden, Springer Verlag, 2004 June 2004, pp. 488–497. [22] J. Stefanowski, On Combined Classifiers, Rule Induction and Rough Sets, Transactions on Rough Sets, 6, LNAI, vol. 4374, Journal Subline, Springer Verlag, 2007, pp. 329–350. [23] G. Valentini, F. Masuli, Ensambles of Learning Machines, Neural Nets WIRN Vietri, 2486, Springer-Verlag, LNCS, 2002, pp. 3–19. [24] WEKA 3.4.6, Data Mining Software in Java, http://www.cs.waikato. ac.nz/ml/weka. [25] S. Pal, H. Biswas, SPID4.7: discretization using successive pseudo deletion at maximum information gain boundary points, in: Proceedings of the Fifth SIAM International Conference on Data Mining, Newport California SIAM, 2005, pp. 546–550. [26] B.K. Sarkar, K. Sachdev, S. Bharati, A. Bhaskar, An interface for converting rules generated by C4.5 to the most suitable format for genetic algorithm, in: Proceedings of the Eighth International Conference on IT (CIT-2005), 20–23 December, Bhubaneswar, India, 2005, pp. 113–115.
254
B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254
[27] E. Bernado-Mansilla, J.M. Garella-Guiu, Accuracy-based learning classifier systems: models analysis and applications to classification tasks, Evolutionary Computation 11 (3) (2003) 209–238. [28] R. Sikora, Learning control strategies for chemical process: a distributed approach, IEEE Export (1992) 35–43. [29] R. Sikora, M. Shaw, A doubled-layered learning approach to acquiring rules for classification: integrating genetic algorithms with similarity-based learning ORSA, Journal on Computing 6 (1996) 174–187. [30] R. Sikora, S. Piramuthu, An intelligent fault diagnosis system for robotic machines, International Journal of Computational Intelligence and Organizations 1 (1996) 144–153. [31] I. Lee, R. Sikora, M. Shaw, A genetic algorithm based approach to flexible flowline scheduling with variable lot sizes, IEEE Transactions on Systems, Man, and Cybernetics 27B (1995) 36–54. [32] J. Koza, Genetic Programming On the programming of Computers by Means of Natural Selection, MIT Press, Cambridge, London, 1992. [33] P.C. Chang, C.H. Liu, Y.W. Wang, A hybrid model by clustering and evolving fuzzy rules for sales decision supports in printed circuit board industry, Decision Support Systems 42 (3) (2006) 1254–1269. [34] C.W.M. Yuen, W.K. Wong, S.Q. Qian, L.K. Chan, E.H.K. Fung, A hybrid model using genetic algorithm and neural network for classifying garment defects, Expert Systems with Applications (2008), doi:10.1016/j.eswa.2007.12.009. [35] K.M. Faraoun, A. Boukleif, Genetic programming approach for multi-category pattern classification applied to network intrusions detections, International Journal of Computational Intelligence 3 (2006) 79–90. [36] W.K. Wong, X.H. Zeng, W.M.R. Au, A decision support tool for apparel coordination through integrating the knowledge-based attribute evaluation expert system and the T–S fuzzy neural network, Expert Systems with Applications, Published Online doi:10.1016/j.eswa.2007.12.068. [37] D.E. Goldberg, Genetic Algorithms in Search Optimization and Machine Learning, New York, Addison Wesley, 1989. [38] J.H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan, Ann Arbor, 1975.
[39] S.W. Wilson, Classifier fitness based on accuracy, Evolutionary Computation 3 (2) (1995) 149–175. [40] N. Japkowicz, S. Stephen, The class imbalance problem: significance and strategies, in: International Conference on Artificial Intelligence, vol. 1, 2000, pp. 111–117. [41] N. Japkowicz, S. Stephen, The class imbalance problem: a systematic study, Intelligent Data Analysis 6 (5) (2002) 429–450. [42] A. Tanwani, M. Farooq, The role of biomedical dataset in classification, in: Proceedings of AMIE: 12th international Conference on Artificial Intelligence, Springer-Verlag, Berlin, Heidelberg, 2009, pp. 370–374. [43] L. Lei, N. Wu, P. Liu, Applying sensitivity analysis to missing data in classifiers, in: Proceedings of ICSSM’ 05, 2005, pp. 1051–1056, ISBN: 0-7803-8971-9 (IEEEXplore). [44] RSES (Rough Set Exploration System) 2.2: http://logic.mimuw.edu.pl/∼rses, 2005. [45] P. Clark, T. Niblett, The CN2 induction algorithm, Machine Learning 3 (4) (1989) 261–283. [46] B. Boutsinas, G. Antzoulatos, P. Alevizos, A novel classification algorithm based on clustering, in: In the First International Conference From Scientific Computing to Computational Engineering, Athens, Greece, 2004. [47] Z. Pawlak, Rough sets, International Journal of Computer and Information Sciences 11 (1982) 341–356. [48] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dordrecht, 1982. [49] L. Hyafil, R. Rivest, Constructing optimal binary decision trees is NP-complete, Information Processing Letters 5 (1) (1976). [50] A. Orriols-Puig, E. Bernadó-Mansilla, Evolutionary rule-based systems for imbalanced data sets, Soft Computing 13 (2009) 213–225. [51] J. Ryan Urbanowicz, J.H. Moore, Learning classifier systems: a complete introduction, review, and roadmap, Journal of Artificial Evolution and Applications (2009), doi:10.1155/2009/736398 (Article ID 736398).