A genetic algorithm-based rule extraction system

Applied Soft Computing 12 (2012) 238–254 Contents lists available at SciVerse ScienceDirect Applied Soft Computing journal homepage: www.elsevier.co...

Download PDF

590KB Sizes 0 Downloads 167 Views

Report

PDF Reader
Full Text

Applied Soft Computing 12 (2012) 238–254

Contents lists available at SciVerse ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

A genetic algorithm-based rule extraction system Bikash Kanti Sarkar a,∗ , Shib Sankar Sana b , Kripasindhu Chaudhuri c a b c

Department of Information Technology, B.I.T., Mesra, Ranchi 835 215, Jharkhand, India Department of Mathematics, Bhangar Mahavidyalaya(C.U.), Bhangar 743 502 24-Pgs(S), W.B., India Department of Mathematics, Jadavpur University, Kolkata 32, India

a r t i c l e

i n f o

Article history: Received 15 July 2010 Received in revised form 7 June 2011 Accepted 21 August 2011 Available online 3 September 2011 Keywords: Classiﬁcation Accuracy C4.5 Genetic algorithm Hybrid system

a b s t r a c t Individual classiﬁers predict unknown objects. Although, these are usually domain speciﬁc, and lack the property of scaling up prediction while handling data sets with huge size and high-dimensionality or imbalance class distribution. This article introduces an accuracy-based learning system called DTGA (decision tree and genetic algorithm) that aims to improve prediction accuracy over any classiﬁcation problem irrespective to domain, size, dimensionality and class distribution. More speciﬁcally, the proposed system consists of two rule inducing phases. In the ﬁrst phase, a base classiﬁer, C4.5 (a decision tree based rule inducer) is used to produce rules from training data set, whereas GA (genetic algorithm) in the next phase reﬁnes them with the aim to provide more accurate and high-performance rules for prediction. The system has been compared with competent non-GA based systems: neural network, Naïve Bayes, rule-based classiﬁer using rough set theory and C4.5 (i.e., the base classiﬁer of DTGA), on a number of benchmark datasets collected from UCI (University of California at Irvine) machine learning repository. Empirical results demonstrate that the proposed hybrid approach provides marked improvement in a number of cases. © 2011 Elsevier B.V. All rights reserved.

1. Introduction In the last decades, one can observe a growing research interest in the ﬁelds of machine learning and knowledge discovery [13]. Recently, the amount of data stored in databases continues to grow fast. Intuitively, this large amount of stored data contains valuable hidden knowledge, which could be used to improve the decisionmaking process of an organization. For instance, data about previous sales might contain interesting relationships between products and customers. The discovery of such relationships can be very useful to increase the sales of a company. However, the number of human data analysts grows at a much smaller rate than the amount of stored data. Thus, there is a clear need of (semi-)automatic methods for extracting knowledge from data, and many data mining algorithms have been developed in order to extract knowledge from large data bases [5]. In practice, one of the main tasks considered in knowledge discovery is supervised classiﬁcation, where learning process is provided with a set of training examples of target concepts (i.e., classes). Speciﬁcally, each example is described by a ﬁnite number of non-target attributes and a target (class) attribute, and the goal of learning is to discover a rule or a function (in machine

∗ Corresponding author. E-mail addresses: bk [email protected] (B.K. Sarkar), shib [email protected] (S.S. Sana), [email protected] (K. Chaudhuri). 1568-4946/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2011.08.049

learning, often called a hypothesis) that maps such descriptions into those classes [13,15]. Furthermore, an algorithm, which consists of knowledge representation (learned from some training set) and the strategy of its usage, forms a classiﬁer, can be used to predict classes of new coming objects. A typical criterion used to evaluate classiﬁer’s performance is classiﬁcation accuracy, i.e., percentage of correctly classiﬁed examples. Several single classiﬁers have been proposed over the years for inducing various knowledge representations (a review is available, for example, in [9,11,13,20]. Although most of these are very effective to particular data sets, and they do not always lead to satisfactory classiﬁcation accuracy in more complex and difﬁcult cases. For instance, the theoretical and empirical comparative studies [4,14] have conﬁrmed that there is no single best algorithm to be used for all datasets. Note that, one of the main reasons behind such obstacles is the nature of data set. For clarity, if a data set is linear but a non-linear learning algorithm is applied or the reverse, then performance accuracy usually suffers, i.e., performance mainly depends on the nature of the data set. Shortly speaking, different algorithms have different strengths and weaknesses—a fact which makes some of them more suitable for certain problems than others. So, to improve these obstacles, the traditional methods are recently combined among themselves or with GA [37] to tackle classiﬁcation problems. Such systems are known under several names such as: multiple classiﬁers, ensembles, committees or classiﬁer fusion [2,4,9,23]. In general, such a

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

system takes a set of same or different learning algorithms to apply them respectively on different or same data sets of the same problem in order to induce knowledge. Then, their predictions are combined to result an integrated system for making decision more accurately. A number of algorithms has been designed to induce knowledge based on this strategy. For a review, one may see [3,6–8,10,12,16,20–22]. However, in the recent years, there has been an increasing interest in involving the use of evolutionary algorithm to learn knowledge. A short review is presented here. GAs have been successfully implemented in learning tasks in different domains such as chemical process control [28], ﬁnancial classiﬁcation [29], manufacturing scheduling [31], robot control [30]. GP (Genetic Program: an extension of GA) [32] is used to evolve a population of Fuzzy rule sets. Ester Bernado et al. [27] have provided an accuracy-based GA approach UCS. Sarkar et al. [17,19] have studied accuracy based learning classiﬁcation system, combining C4.5 [18] and GA. In order to forecast the future sales of a printed circuit board factory more accurately, Chang et al. [33] have proposed a hybrid model in which GA is utilized to optimize the Fuzzy Rule Base (FRB) adopted by the Self-Organization Map (SOM) neural network. The GA part of the hybrid model [34] has been employed to ﬁnd out an optimal structuring element for classifying garment defect types. Faraoun and Boukelif [35] made an attempt to show the use of a new GP classiﬁcation approach for performing network intrusion detection. Wong et al. [36] presented a decision support tool, combining an expert system and the Takigi-Sugeno Fuzzy Neural Network (TSFNN) for fashion coordination. They have also shown that the GA plays here an important role in reducing the coordination rules and the training time for TSFNN. Albert et al. [50] investigate the capabilities of evolutionary on-line rule-based systems, also called learning classiﬁer systems (LCSs), for extracting knowledge from imbalanced data. Finally, Urbanowicz Ryan and Moore [51] have given a complete roadmap on LCS. Looking into the past researches on machine learning and referring the article [51], we may comment that most of the classiﬁcation systems are designed for solving speciﬁc types of problems. Some are specialized to solve multiple class learning problems. A few are designed for class imbalanced problems. But learning system concentrating on various factors altogether is rarely designed. Again based on the analysis of the existing GA-based systems, implementation and computation costs are identiﬁed as their common issues. Besides, they usually suffer from improvement of performance over volumetric data because of increasing population-size. With these points in mind, a hybrid GA-based learning system (named as DTGA) is proposed in this study to improve prediction accuracy over any classiﬁcation problem irrespective to domain, size, dimensionality and class distribution. The combining approach is very simple and straightforward to implement. Structurally, the system consists of three phases: (i) the ﬁrst phase applies C4.5 rule induction algorithm to extract a base set of IF-THEN rules (R) from known labeled “training” data set (say Etrain ), whereas (ii) the Interface [26] is used in the next phase to tackle conveniently the interpretability problem of the rules for applying GA, and ﬁnally (iii) the suggested GA is employed on the R to ﬁlter out informative rules based on the same training set Etrain . More speciﬁcally, at GA-phase, the worst rules in R are expected to be replaced with the better new rules (in terms of higher accuracy and lower error rate) of classes identical to the respective worst rules. For instance, an existing worst rule of a particular class (say c1 ) is replaced with a better new rule of class c1 . This phenomenon especially concentrates to improve classiﬁcation performance of the system over any data set (even on imbalance data set). In other words, the better new rules generated in GA phase are desired to be survived, removing the worst rules of the respective classes. It is, indeed, based on steady-state approach of GA. Clearly, the numbers of rules of respective classes present in the original rule set R for any classiﬁcation problem remain unchanged

239

in its optimized version too; only the better rules are copied in place of the worst rules to improve the overall classiﬁcation performance of the rule set. In addition, the proposed system has also the ability to improver performance over volumetric data because it begins by taking a smaller-size (i.e., population size is comparatively small) but informative rule set. Note that the past research states if the population size increases, then performance degrades since waiting time for effective crossover becomes too long and there is insufﬁcient juxtaposition of building block prior to convergence. Further, a system performs poorly over large date when GA alone is used, since there is no variation in the population for crossover. However, GA in our system starts with a better population produced by C4.5. Now regarding C4.5, it is true that the rules learned by C4.5 are collectively visualized in form of decision tree, and their representation is quite precise and simple to interpret in comparison to the other learners. Moreover, such rules are comparatively convenient to apply GA. Also, constructing decision trees are computationally inexpensive, even when the training set size is very large. Although, it is difﬁcult to get the best decision tree from information plus, just using such a heuristic strategy. In fact, constructing the best decision tree (in terms of number of nodes) is an NP-complete problem [49]. The defect is due to searching strategy. Besides, decision tree rule induction algorithms like C4.5 usually discover rules which are preferably better on training set rather than the test set due to the presence of deeper nodes in the tree. Actually, information gained by deeper nodes is nominal and reﬂects only on training set too much (known as overﬁtting). In addition, such an overﬁtted model certainly requires more storage space and longer time to execute. Of course, pruning strategies are commonly applied to stop the growth of tree (i.e., to correct overﬁtting). But pruning a decision tree certainly causes to misclassify some examples from training set. More importantly, the misclassiﬁcation rate (i.e., error rate) increases especially on imbalance data set (i.e., data set with uneven class distribution). It may be noted here that decision tree based learners are normally biased towards majority classes [40,41]. In fact, the construction strategy of decision tree prioritizes on the number of examples with class values, and each time one best attribute is selected as a node of the ongoing tree based on probabilistic measure. Certainly, the ﬁnal constructed decision tree tends to classify all the samples into major classes, and accordingly imbalance problem causes serious affect the performance of the decision tree. Interestingly, GA has the speciﬁc function to solve the problems suffered by C4.5. It can improve and increase the ability of classiﬁcation algorithm. So, to overcome considerably the drawbacks of C4.5, we prefer here GA to reﬁne the rule set (R) learned by C4.5 in the subsequent phase by keeping the number of rules unchanged. More clearly, the ultimate goal of this phase is to evolve maximally accurate (i.e., informative) rules and replace the worst ones of R with the better new ones to achieve high classiﬁcation accuracy. After all, the contribution of the proposed sampling technique (as described in Section 2.1) also is not nominal here. Truly speaking, this sampling strategy acts to enrich the population (generated by base classiﬁer) for the GA. To show the strength of the proposed learning system, experiments are performed on a number benchmark data sets [1]. Although, the attributes in the selected data sets contain continuous values but they are discretized by SPID4.7 discretizer [25]. Apart from the base learner C4.5 (Decision Tree), the performance of the present system is compared with three other competent learners (each belonging to a distinct family), namely, Neural Network [24] (Artiﬁcial Neural Network), Naïve Bayes [24] (Bayesian Network) and Rule-based classiﬁer on rough set theory [44], over the selected data sets. Let us remind here that there exist varieties in Neural Network. The approach selected here is based on input, hidden and output layers. Input layer has n neurons – one for each input

240

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

parameter, whereas output layer has k neurons – one for each class. The hidden layer has (n + k)/2 neurons. Each neuron uses sigmoid function. Apparently, the idea of the constructed system is much similar to the methods [17,19] but the present investigation differs the other two in the following respects: (i) the number of selected data sets, the splitting technique, (ii) ﬁtness function, encoding and decoding techniques adopted in the GA, and ﬁnally (iii) the number of classiﬁers for comparing performance and the evaluation strategies. The paper is organized as follows. Section 1 discusses the importance of machine learning algorithms, its drawbacks and the recent trend. Section 2 introduces Learning Classiﬁer System (LCS) [38] and the proposed DTGA approach including the suggested sampling technique. Section 3 describes shortly each of the other learners used in this study, whereas experimental design and analysis of the results are discussed in Section 4. Finally, conclusion is summarized in Section 5. 2. DTGA as LCS J. Holland presented the ﬁrst architecture of LCS in 1975. Since then, a number of investigations on its architecture has been performed to solve classiﬁcation problem. Although, in 1976, it is split into two categories depending upon where the GA acts: (i) Pitsburgstyle LCS in which a population of separate rule sets is considered, and the GA recombines and reproduces the best of these rule sets, whereas; (ii) Michigan-style LCS consists of only a single population and the GA focuses on selecting the best rules within the rule set. UCS (an accuracy based classiﬁer system) is an example of Michigan-style LCS, speciﬁcally designed for supervised environment. Basically, it keeps the principal features of XCS (a best action map based classiﬁer system) [39], but changes the way in which accuracy is computed. In particular, the GA in XCS is run on the action sets preferably, whereas the GA in UCS is inherited from XCS and performs crossover and mutation, selecting two classiﬁers from correct set[C] with the probability proportional to the ﬁtness. The resulting offspring are then inserted in the population, leaving the incorrect rules. Structurally, many of the design criteria are optimized in UCS. As per the discussion on LCS-style, we may comment that the present investigation DTGA is also based on Michigan-style LCS, and consists of three phases (as already pointed out above). Fig. 1 illustrates the conceptual model of the proposed learning system. Also, the procedures involving in the phases are discussed in this section. But before discussing them in details, we must ﬁrst describe the proposed sampling strategy adopted in the current model. 2.1. Proposed data-splitting technique Data splitting technique is an important issue in machine learning. It is really a very difﬁcult task to identify how many instances are sufﬁcient to gain knowledge for making good decision. Further, a data set may be imbalance too, i.e., the majority class of the data set may dominate heavily the minority class. Although, this sort of data is usually found in medical domain. Experimentally, the imbalance nature of data set is often reported as an obstacle to the induction of good classiﬁers. Some common data-sampling techniques such as: random over-sampling, random under-sampling, synthetic minority over-sampling technique (SMOTE), etc. are usually followed to handle imbalance data set. A brief on each is given below. • Random over-sampling is a non-heuristic method that aims to balance class distribution through the random replication of minority class examples.

• Random under-sampling is also a non-heuristic method that aims to balance class distribution through random elimination of majority class examples. • Synthetic minority over-sampling technique (SMOTE) The SMOTE is an over-sampling method that creates new minority class instances by interpolating several minority class examples that lay nearby in the feature space. It is said that this method avoids overﬁtting by creating rather than replying instances of the minority class. Although, each of the above groups has some drawbacks such as: computational load increases due to over-sampling, whereas undersampling does not take into account all available training data which corresponds to loss of available information. In fact, there is no good solution for such a problem, and so this research problem is still open. However, keeping imbalance problem of data set as one point in mind and concentrating on the learning strategy of the base classiﬁer C4.5 of DTGA, we suggest here a new data sampling technique. It is presented below. In this approach, approximately 30% examples from each data set (say E) are selected as training set labeled as Etrain , maintaining almost equal proportion in class distribution over the data. In other words, attention is paid here to include 30% examples of each classvalue into Etrain . For instance, if there are 3 class values (say c1 , c2 and c3 ) in a classiﬁcation problem P with 150 examples in total, and number of examples of class-types: c1 , c2 and c3 are respectively 30, 45 and 75. Then, 9, 14 and 23 examples of class-types c1 , c2 and c3 , respectively are to be included into Etrain by random selection on E. More clearly, 3 examples (each of one distinct class value) are included into Etrain. in the ﬁrst pass. However, as soon as one example is included into Etrain , it is immediately crossed out from E. Similarly, in the next pass, another 3 distinct examples are to be included into Etrain and then these are crossed out from E and so on, until Etrain consists of at least 30% of the total. This hold-and-out strategy is, in fact, known as sampling without replacement. Let us remind here that we apply ceil function to place 30% examples of each class into Etrain , and inclusion of examples of any class-label is stopped immediately when 30% examples of that class-type are already in Etrain . For instance, (30 × 45)/100 = 13.5/2 = 14 examples of class c2 (out of 45) are to be included here into Etrain. Finally, when Etrain is constructed, then the remaining of E (i.e., E − Etrain ) is treated as test set and labeled as Etest . Now, both the Etrain and Etest are considered as training set and test set respectively to learning algorithms. However, contents of Etrain and Etest vary at each iteration, while keeping their sizes unchanged. It is, indeed, a sampling technique based on natural distribution of classes. However, the approach has capability to control the biasness of majority classes in the training set, since less percentage of each class value is selected. Let us highlight the expected strengths of this sampling technique as follows: • First of all, such a sampling strategy provides a smaller size training set with almost uniform proportion in class distribution over the data (i.e., no class value is ignored). Now, as per the concept of information theory, the more uniform is the probability distribution, the greater is its information. In other words, if the training set avails examples of all classes in uniform proportion, then it should be much more informative than that of ignoring the rare classes. Again, entropy (i.e., degree of doubt) relates to information for a message with negative sign only, implies that amount of doubt (impurity) is reduced by gaining/conveying same amount of information. Hence, looking into the entropy function (see at Section 2.2), we expect that an overall balanced rule set (i.e., informative rules consisting of all classes) will be generated from the training set by the base classiﬁer C4.5 of the proposed system at the very

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

241

Fig. 1. Discovering optimized rules by DTGA.

beginning of the investigation. Usually, C4.5 generates rules, each of which covers more training examples. However, if rules covering class values with small training examples are ignored, then classiﬁcation accuracy signiﬁcantly degrades. But our sampling strategy tries to stop it. Indirectly, it pays attention to overcome imbalance problem in the data set to some extent. More importantly, if training set covers examples more than 30% of entire data set by maintaining uniform proportion in class distribution, then probability of availability of minority-class examples in the training set decreases due to their rareness in the entire set. This causes again the biasness towards majority classes. That is why we have chosen here 30% instances in training set and the remaining 70% in test set. • More over, such a balanced rule set claims to consist of small number of rules as compared to rule set generated from a training set containing more than 30% examples of the entire data set. Accordingly, it makes convenient to apply GA especially on volumetric data set, since GA usually suffers from improvement of accuracy over large data set due large-size population. • After all, such a smaller-size training set reduces the biasness of any learning model towards the known data in order to make it reliable on unseen data, since larger training data increases the biasness towards the known data.

out going edge from an internal node represents one value (known as test value) of that attribute. On the other hand, other nodes are called leaves (known as terminal or decision nodes). Basically, each internal node in decision tree splits the instance space into two or more sub-spaces according to the certain decision function of the input attribute values, and each leaf is assigned to one class representing the most appropriate target value. Obviously, instances are classiﬁed by navigating them from the root (the most informative node) of the tree down to a leaf, according to the outcome of the tests along the path. Concisely, a decision tree is tree in which the nodes represent attributes of a data set. The attributes are chosen according to some criterion, such as information gain. The information gain indicates how informative an attribute is with respect to the classiﬁcation task using its entropy. The higher the variability of the attribute values is the higher its information gain. The attributes with the higher information gain are chosen to create the next nodes. The leaves of the tree represent the classes. Hence, a top-down approach is adopted here. The C4.5 is a good example of decision tree based rule induction algorithm that uses entropy computation to determine the best relevant attribute at each node of the tree. In fact, entropy is used to measure how informative is a node, and it is deﬁned as,

The strengths of the proposed sampling technique are shown in Section 4 and Appendix D.

Entropy (S) = −

2.2. C4.5: a decision tree based classiﬁer

where S is the collection of learning examples and pi is the proportion of S belonging to class i among c (number of) classes. In C4.5 approach, three distinct steps can be identiﬁed: (i) constructing a Decision tree, (ii) pruning the tree and (iii) rule induction based on the pruned tree. Each step operates at different level. In the ﬁrst step, the C4.5u (unpruned) algorithm constructs a decision tree. During this process, the conjuncts of the rule set are chosen one by one, considering the information gain (or gain ratio) of the possible attributes. This recursive partition method begins

A decision tree is a classiﬁer expressed as a recursive partition of the instance space. It consists of nodes that form a rooted tree, meaning it is a directed tree with a node called a “root” that has no incoming edges. All the other nodes have exactly one incoming edge but may have zero or more outgoing edges. By deﬁnition, a node with outgoing edges is referred to as internal or test nodes. In particular, each internal node represents an attribute, and each

c

pi log2 pi ,

i=1

242

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

with bulk data because of the fact that the ﬁrst branching is based on all learning materials. But it becomes more and more local, as each choice for branching the decision tree is dependent on a local comparison of the entropy of the remaining learning examples. In the second step, the C4.5p (pruned) algorithm prunes the tree generated by C4.5u. The resulting tree of C4.5u is often very complex and over-ﬁts the data. So, the idea is to remove parts of the tree that do not contribute to classiﬁcation accuracy on new material, through C4.5p. Further, error estimation for leaves and sub-trees are computed to remove or combine parts of the tree. Finally, the C4.5r (rule induction) algorithm accepts the decision tree obtained by C4.5p and each path from the top node to a leaf of the modiﬁed tree yields a potential decision rule which is expressed in simple IF–THEN form. In fact, Quinlan developed ID3 rule induction algorithm (based on decision tree) in 1986 and C4.5 induction s/w package in 1993. The C4.5 is derived from ID3 but it improves ID3, as it handles continuous as well as missing data, and follows pruning strategies to remove the parts of the tree that do not contribute to classiﬁcation accuracy. 2.3. Interface In general, the decision rules (i.e., knowledge) induced from examples are represented as logical expressions of the form: IF (conditions) THEN (decision class), where conditions in each rule are conjunctions of elementary tests on values of attributes, and decision part indicates the assignment of an object (that satisﬁes the condition part) to a given decision class. Thus, each rule can be viewed as: antecedent → consequent, where antecedent part consists of conjuncts (i.e., conditions) and consequent is the decision (i.e., action). In the present approach, the C4.5 rule induction algorithm adopted in the ﬁrst phase extracts rules from training examples. Each extracted rule is here close to ‘IF–THEN’ form, and it is shown through the example of golf-playing problem presented in Appendix B. Obviously, the rules of such ‘IF–THEN’ format creates interpretability problem to apply GA. So, to make them more convenient for applying GA, we prefer here such rules in tabular like form (as shown in the last part of Appendix B). In particular, the Interface performs the task of such representation of the rules, eliminating ‘IF–THEN’ part. The discrete values below the names of the attributes in such a tabular form specify their respective values. Actually, a ‘*’ symbol appearing in a rule simply says absence of condition of attribute corresponding to the position of ‘*’ (i.e., the attribute corresponding to ‘*’ has no importance in that rule). Although, ‘*’ value for an attribute (say Ai ) in a rule is positioned following its (i.e., Ai ’s) position in the data set. Again, all the nontarget attributes irrespective to their presence or absence in rule(s) are herein strictly considered in rules with a view to simpler implementation.

mutation or other operators. In practice, encoding of parents is done before crossover operation. There are several ways of encoding parents. For example, one can encode individuals directly as integer/real number or as array of integers/decimal numbers with each position, representing some particular aspect of solution. However, the solutions are normally encoded by equivalent binary representation. Among the possible genetic operators, crossover is an important one. It merges the genetic information of two existing individuals (parents), picked up by selection operator, and creates two new individuals (children) called as offspring. There are numerous ways to perform crossover operation but the simplest one is the singlepoint crossover where two chosen individuals are cut at a randomly selected point within the length of the parents. The tails (the parts after the cutting point) are swapped, leading to two new individuals. In general, if the representation of the individuals is binary (0, 1), then in Mutation operation, a zero (‘0’) is changed to one (‘1’) and vice versa. Although, a simple GA treats the Mutation only as secondary operator with restoration of the genetic material. However, it helps to ﬁnd the global optimal solution of the problem, searching new areas. Further, the ﬁtness function ensures the evolution towards optimization by calculating the ﬁtness score for each individual in the population. In fact, this value evaluates the performance of each individual in the population. Concisely, the whole process of evolving from one population to next population is called a generation. The process continues until a predeﬁned termination criterion (for instance, the achievement of a performance target or a certain number of fruitful generations) has been met. 2.4.1. Fitness function The ﬁtness function is essentially the objective function for the problem. It provides a means of evaluating the search solutions, and also controls the selection process. However, it is well-accepted that the ﬁtness function is the only problem dependent part of GA. For classiﬁcation, we can consider factors such as prediction accuracy, error (i.e., misclassiﬁcation) rate, imbalance class problem, etc. Keeping these points with mind, we have proposed here a new promising ﬁtness formula (as given below). f (ri ) =

n n−m + n+m m+k

(1)

where ri represents the ith decision rule, ‘n’ is the number of training examples satisfying all the conditions in the antecedent (A) as well as the consequent (C) of the rule (ri ) too, i.e., correct classiﬁcation; ‘m’, the number of training examples which satisfy all the conditions in the antecedent (A) part but not the consequent (C) of the rule (ri ), i.e., misclassiﬁcation; and ‘k’, a predeﬁned positive constant value. For instance, suppose n = 8, m = 2 for a rule r. Again, assume that k = 4. Therefore, the ﬁtness value of rule ri will be as follows.

2.4. Proposed GA based sub-system

8−2 8 + = 1.99 8+2 2+4

This section begins with a short introduction on genetic algorithm, and then discusses in details about each genetic component proposed in this article. In fact, genetic algorithms are popular search algorithms based on the mechanics of natural selection and evolution, start with an initial population of solutions of a problem. The initial population is then manipulated using various genetic operators to produce new population with the aim to optimize the solution(s) of the problem. The frequently used genetic operators are selection, crossover, mutation, dropping condition, etc. Normally, two parents are chosen randomly from a population, and two new children are generated by applying subsequently crossover,

Ideally, this evaluation function reduces collision occurred among ﬁtness scores of the rules, and ensures to survive the rule with higher classiﬁcation accuracy but lower error-rate, i.e., it removes noisy rule. In fact, the ﬁrst part of the function plays a great role to satisfy that task. A brief illustration on the presented function is given below to justify its goodness. Suppose that two rules: r1 and r2 are classifying correctly 16 and 20 examples respectively, whereas they are misclassifying say 1 and 5 examples. Clearly, n = 16, m = 1 for r1 , whereas n = 20, m = 5 for r2 . In such a case, it is very difﬁcult to choose the better one between these two rules by the ﬁtness function suggested in [19],

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

since both the rules return identical score as per that adopted evaluation function. However, the present ﬁtness function given in Eq. (1) easily resolves such tie by resulting ﬁtness scores 4.08 and 2.82, respectively for r1 and r2 . It can be pointed out here that occurring identical value by the ﬁrst part of this function for two different rules is rare, unless the values of m and n are multiplied by same factor or the value of m is zero. If it happens so, then the second part of the function takes a vital role to break the tie. Note that a parameter k with positive constant is introduced here, as the case: m = 0 may occur. Thus, the present ﬁtness function reduces the chance of occurring collision in ﬁtness score, and highly claims the survival of appropriate rule on the basis of higher ﬁtness score. In fact, the two parts altogether in the present evaluation function are satisfying such a role. In addition, it does not ignore to discover better rules of minority classes also due the presence of balanced rule set. 2.4.2. Encoding and decoding strategy In this phase, the discretized values in rules (solutions) are directly used. No other representations such as binary coding (and the reverse) are applied here. In fact, the main reason behind considering such a strategy is that the values of the attributes present in the conjuncts of the rules are signiﬁcantly selected by C4.5, and we are not interested to loose these values. Only the better new individuals which are ﬁt for test (unseen) data set, are expected to generate by recombining the conjuncts. Clearly, such a strategy is very simple and less time consuming, as individuals are computer program rather than bit strings. 2.4.3. Genetic operators 2.4.3.1. Selection and crossover. In our experiment, each rule of any classiﬁcation problem is considered to be of uniform length, assuming unit length for each attribute. For example, if a classiﬁcation problem P consists of r attributes (including the target as well as the non-target attributes), then length (L) of each rule of the same problem is considered as r (since, all the non-target attributes of the problem irrespective to their presence or absence in rule are herein strictly considered in rule for simpler implementation, and unit length is assigned to each attribute). In this study, the selection of parents for performing cross-over is done at random. Actually, rules in the rule set (i.e., population) are numbered as: 1, 2, . . ., n. Now, two parents say: p1 and p2 are picked randomly from the current rule set, and placed them into a mating pool. Next, in the purpose of cross-over operation, two distinct points (xi , i = 1, 2) are chosen randomly within 1 < xi < L. Finally, the heads and tails (the parts respectively before and after the cutting points) of the parents are swapped, leading to two new individuals: say O1 and O2 . 2.4.4. Proposed genetic-based algorithm The proposed algorithm begins with an overall balanced rule set R learned by the base classiﬁer C4.5 applied on a set of training examples (Etrain ). The training set is, in fact, chosen by the strategy discussed in Section 2.1, and the same training set Etrain is used for our GA too. The GA proceeds by choosing chromosomes (rules) to serve as parents and possible replacement of those with new chromosomes based on ﬁtness score. Obviously, the quality of each rule r1 ∈ R is measured by Eq. (1) (speciﬁed in Section 2.4.1). Also, in order to ﬁnd the overall ﬁtness score of the new rule set while replacing the worst one (s) of R with the generated children, the approach uses a temporary data structure RT (structurally similar to R). Let us remind here that, each time, all the rules from RT are deleted before copying the content of R into RT . Further, the algorithm adopts the coding and crossover techniques illustrated in Sections 2.4.2 and 2.4.3, respectively, and terminates after a speciﬁc number of generations. The goal of the algorithm is to replace the worst rule with

243

the better new rule with identical class of the worst one in order to improve prediction accuracy. Clearly, rules of minority classes will not be removed from the reﬁned rule set. A brief sketch of the algorithm is outlined below. Assumptions: • Input examples are discreetized • Rules are of uniform length and, take the form like: [1 * 2 1 * 3 0] (Last value is the class attribute’s value ‘*’ is treated as don’t care value.) Variables: Max itr: maximum number of iterations (generations)//it is a predeﬁned number no itr: number of iterations (generations) RT : rule set Input: R (rule set generated by C4.5), Etrain (training data set), Max itr //f(ri ) implies ﬁtness score of rule ri computed on Etrain using Eq. (1) //F(R) implies overall ﬁtness of rule set R computed on Etrain using Eq. (2) begin no itr ← 0 //initially zero is assigned to iterations (generations) RT ← NULL Step 1

Randomly select two parents: P1 and P2 from R, and place them into a mating pool Apply the suggested two-point crossover operation on the Step 2 selected two parents P1 and P2 to generate two new offspring say: O1 and O2 no itr ← no itr + 1 for each existing rule ri ∈ R do the followings: Step 3 /* This segment attempts to ﬁnd sub/super rule of O1 in R */ Step 3.1 If the class values of both O1 and ri are same then begin Find the number (m) of pre-conditions matched between Step 3.1.1 O1 and ri . If (m =min(m1 , m2 )) then /* min(m1 , m2 ) returns the minimum between m1 and m2 , where m1 and m2 are respectively the number of exact precondition present in O1 and ri */ If (m1 < m2 ) then copy O1 in place of ri else discard O1 and go to step-5. end /* next part attempts to ﬁnd conﬂict rule of O1 (if any) in R, and then not to include O1 .*/ else //i.e., class values are not same Step 3.2 If the antecedent part(i.e., LHS) of O1 is identical to ri , then Step 3.2.1 discard O1 and go to step-5. end for //This segment attempts to place O1 in place of some other distinct rule of same class in R, if possible Find the lowest ﬁtness-valued rule rw from R, having class Step 4 same as of O1 . If f(O1 ) < f(rw ) then go to step-5 else compute F(RT ) on Etrain by copying the current content of R into RT and replacing rw with offspring O1 . If F(RT ) > F(R) then copy O1 in place of rw in R else ignore O1 . for each rule ri ∈ R do the followings: Step 5 /* This part attempts to ﬁnd sub/super rule of O2 in R */ If the class values of both O2 and ri are same then Step 5.1 begin Find the number(m) of pre-conditions matched between Step 5.1.1 O2 and ri . If (m = min(m1 , m2 )) then /*min(m1 ,m2 ) returns the minimum between m1 and m2 , where m1 and m2 are respectively the number of exact precondition present in O2 and ri */ If (m1 < m2 ) then copy O2 in place of ri else discard O2 and go to step-7. end

244

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

/* next part attempts to ﬁnd conﬂict rule of O2 (if any) in R, and then not to include O2 */ else //i.e., class values are not same Step 5.2 If the antecedent part (i.e., LHS) of O2 is identical to ri , then Step 5.2.1 discard O2 and go to step-7. end for //This segment attempts to place O2 in place of some other distinct rule of same class in R, if possible Find the lowest ﬁtness-valued rule rw from R, having class Step 6 same as of O2 . If f(O2 ) < f(rw ) then go to step-7 else compute F(RT ) on Etrain by copying the current content of R into RT and replacing rw with offspring O2 . If F(RT ) > F(R) then copy O2 in place of rw in R else ignore O2 . Step 7 If the desired number of iterations(generations) is not completed (i.e., no itr < Max itr) then delete the parents from the mating pool and go to step-1. end Optimized version of the rule set R (in discreetized form). Output

Now, to identify distinct rule or identical rule, we are concerned about the preconditions of rule with numerical values only (i.e., other than ‘*’). For more clarity, explanations with some examples are given below. We ﬁrst take the case of distinct rule. For instance, each of the rules r1 : 4 2 * 1 c1 (class) and r2 : 4 2 * 2 c1 (class) has 3 preconditions, since there are 3 pre-conditions with numeric values at each of r1 and r2 . However, r1 and r2 are not identical to each other because the fourth (from left) precondition values of r1 and r2 are not same (i.e., these are 1 and 2, respectively) although the ﬁrst two pre-conditions of each match exactly. Obviously, the number of preconditions matched between r1 and r2 is here 2, it is not equal to min(3,3) = 3. Hence, both r1 and r2 are here distinct. Next consider the case of identical rule. Given two rules: r1 : 4 2 * 1 1 (class) and r2 : 4 * * 1 1 (class) may be treated as identical because the number of preconditions present in r1 and r2 are 3 and 2, respectively but they are matching only at two places. Clearly, the number of matched preconditions is here 2 (i.e., u = 2). Again, min(3,2) returns 2, and it equals to u. Hence, both the rules are identical but one may supersede the other. In other words, out of these two rules, one is the super rule of the other. Obviously, number of preconditions of r2 is 2, and it is less than that of r1 (i.e., 3). So, r2 is here treated as the super rule of r1 , and r2 (instead of r1 ) is well expected in rule set with the aim to classify more test examples. Further, overall ﬁtness (F(R)) is basically the overall classiﬁcation accuracy of the rule set (R) on the training examples, and it is deﬁned as follows.

F(R) =

Number of training examples correctly classiﬁed by the rule set (R) × 100 Total number of training examples

(2) Actually, during genetic evolution of rules, it may also happen that a new rule covers examples which are already classiﬁed by an existing rule other than the worst rule with same class as of the new one. In such a situation, ﬁtness score of the new one may be observed higher than that of the worst one. If this is the case, then the overall classiﬁcation of rule set may be decreased if the new one takes place of the worst one. With this point in mind, this function is introduced not to include such a new redundant rule at any cost in place of the existing worst rule. In time complexity point of view, each of the steps-3–7 of the algorithm takes O(n) running time in worst case for each generation, assuming n number of rules are present in the rule set R. 3. Other classiﬁers To compare the strength of the proposed system DTGA, we have used here three other competent learners (each belonging to a

distinct family) namely, Neural Network (Artiﬁcial Neural Network), Naïve Bayes (Bayesian Network) and Rule-based classiﬁer on rough set theory apart from the base classiﬁer C4.5 (Decision Tree). A brief study on each of these is given below.

3.1. Neural network Neural networks (NNs) are often referred to as Artiﬁcial Neural network to distinguish them from biological Neural networks. It can be viewed as a directed graph with source (input), sink (output) and internal (hidden) nodes. The input nodes exist in an input layer, and the output nodes in an output layer. The hidden nodes occur in one or more hidden layers. Solving a classiﬁcation problem using NNs involves the following basic steps: • Determining the number of input nodes, output nodes and the hidden layers. • Assigning weights (labels) and activation functions to be used for the graph. • For each tuple in the training set, propagate it through the network and evaluate the output prediction to the actual result. If the prediction is accurate, adjust labels to ensure that this prediction has a higher output weight. On the other hand, if the prediction is not correct, then adjust the weights to provide a lower output value for this class. It is observed that in learning from previous experiences, neural network acts as an excellent tool in prediction. However, a disadvantage is that it is difﬁcult to design neural network architectures (mainly activation functions and trained parameters), and the learned knowledge is not shown explicitly.

3.1.1. The Naive Bayesian classiﬁer The Naive Bayes classiﬁer technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classiﬁcation methods. In purpose of classiﬁcation, the approach calculates probabilities of different classes given some observed evidence. It is called naïve because it assumes independence of all attributes to each other. A brief description on this method is given below. Given a data value xi the probability that a related tuple (i.e., example) ti , is in class Cj is described by probability p(Cj |xi ). Training data can be used to determine p(xi ), p(Cj |xi ), and p(Cj ). All these are prior probabilities based on previous experiences. Again, from these known probabilities, Bayes theorem allows us to estimate the posterior probability p(Cj |xi ) and then p(Cj |ti ). A form of Bayes theorem: p(X|Y ) =

p(Y |X) − p(X) p(Y )

where p(X) is the prior probability and p(X|Y) is the posterior probability. Given a training set, the Naïve Bayes algorithm ﬁrst estimates the prior probability p(Cj ) for each class by counting total number of examples of each class in the training set. Again for each attribute, xi , the number of occurrences of each attribute value xi can be counted to determine p(xi ). Similarly, the probability p(xi |Cj ) can be estimated by counting how often each value occurs in the class in the training data set. It is true that a tuple in the training data may have many different attributes, each with many values. Suppose that ti (a new tuple) has q independent attributes values {xi1 , xi2 , . . .xiq }. From the descriptive phase, we know p(xiq |Cj ) for

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

each class Cj and attribute xiq . We then estimate p(ti |Cj ) as follows:

p(ti Cj ) =

q

p(xik Cj )

k=1

Now, to calculate p(ti ) based on the known probabilities, we can estimate the likelihood that ti is in each class. The posterior probability p(Cj |ti ) is then found for each class. The class with highest probability is the one chosen for the tuple. Precisely, the Bayesian classiﬁer is easy to use and understand, and can easily handle missing values by simply omitting that probability when calculating the likelihoods of membership in each class. But it may not give always satisfactory results, as the attributes usually are not independent and the technique does not handle continuous data. Although, dividing the continuous values into ranges can be used to solve this problem, but the division of the domain into ranges not an easy task. 3.2. Rule-based classiﬁer and rough set approach The basic criteria of any rule-based classiﬁer is to generate IF (conditions) THEN (decision class) rules, where conditions of each rule are simply conjunctions of elementary tests on values of attributes and decision part indicates the assignment of an object (that satisﬁes the condition part). The aim is to cover all the training cases. That is why these techniques are sometime called as covering techniques. In this context, we may point out that a decision tree based rule inductive algorithm can always be used to generate rules of such form but the rules induced by a rule-based classiﬁer may not necessarily build a tree, because nodes in the tree have order but rules generated from rule-based classiﬁer have no order. Some pure rule-based examples are CN2 [45] and CL2 [46]. The rough set (RS) approach (derived from set theory) also has the capability to generate decision rules in IF–THEN form. This theory was ﬁrst introduced by a Polish computer scientist Pawlak [47,48] during the early 1980s. As per this theory, data is collected in a table called decision table, where rows of a decision table correspond to objects and columns correspond to features. Usually, decision tables are difﬁcult to analyze, since they store a huge quantity of data, which is hard to manage from a computational point of view. Moreover, some facts in the table may not be consistent to each other. So, it is necessary to reduce the size of the data. One of the main objectives of RS data analysis is to reduce data size. Interestingly, this approach has the ability to reduce redundant or inconsistent information from data table and to ﬁnd minimal sets of attributes called reducts (i.e., reduction of dimensionality). A rough set classiﬁcation model can be simply partitioned into three distinct phases: (i) Pre-processing phase: This phase includes tasks such as extra variables addition and computation, decision classes assignments, data cleansing, completeness, correctness, attribute creation, attribute selection and discretization. (ii) Analysis and rule generating phase: In this phase, generation of preliminary knowledge such as: computation of object reducts from data, derivation of rules from reduct, and rule evaluation and prediction processes are taken into account. (iii) Classiﬁcation phase: This phase utilize the rules generated from the previous phase to predict the unseen data. Obviously, such an approach is easy to understand, offers straightforward interpretation of obtained results. On the basis of case study on rough set theory, we notice that the rough-set theory was mainly used to preprocess data and to classify objects at the beginning. Therefore, its community has

245

concentrated to construct efﬁcient algorithms for extracting rules. But recently, it is often used within classiﬁcation algorithms, i.e., it combines the merits of rough set as well as the other classiﬁers. Several efﬁcient methods for creating classiﬁers have been introduced; for a review see, e.g., [11,15]. These classiﬁers are often constructed with using a search strategy optimizing criteria strongly related to predictive performance (which is not directly present in the original rough sets theory formulation). Several authors have developed their original approaches to construct decision rules from rough approximations of decision classes which joined together with classiﬁcation strategies led to good classiﬁers, see e.g. [7,22]. In a nutshell, the complexity of classiﬁer can be reduced and its classiﬁcation performance can be improved using rough set theory, but the problem of ﬁnding a minimum-length reduct is NP-hard, so heuristics are used in searching for short reducts. 4. Experimental design and results 4.1. Experimental study This subsection describes the details of the experiments including the data sets, the algorithms used to compare and the settings of the tests. In particular, three sets of experiments are conducted in this study, namely, experiment on the proposed sampling technique, experiment for the performance of the proposed learning classiﬁer and ﬁnally experiment for the sensitivity analysis of the learners. Five learning algorithms in total are used in our investigation. Note that C4.5 [18] is a downloaded s/w, whereas the presented GA-based algorithm is implemented in Java-1.4.1 on a Pentium4 running on Mandrake Linux OS 9.1. Regarding the other used learners, Naïve Bayes and Neural Network implementations are in the WEKA (Waikato Environment for Knowledge Analysis)-3.4.2 framework, whereas Rule-based classiﬁer on Rough set is in RSES (Rough Set Exploration System)-2.2. All experiments are performed on the same machine. In context of DTGA, parents and crossover sites are selected randomly with a different random seed each time. The suggested cross-over site is here two-point. Also, the predeﬁned number of generations for GA is chosen as 80. The algorithms are tested on 18 benchmark data sets of realworld problems drawn from UCI machine learning repository. Table 1 gathers the relevant features of the problems. The problems are arranged in the table in alphabetical order of their names. More importantly, the last three columns of this table are chosen to show class imbalance of the data sets. Note that imbalance ratio of each data set is computed using the formula proposed by Tanwani and Farooq [42], and shown at the 5th column of Table 1. It is once again reminded here by Eq. (3). Nc − 1 Ii , Nc In − Ii Nc

Imbalance ratio (Ir ) =

(3)

i=1

where Ii denotes the number of instances of ith class; and In , the total number of instances. On the other hand, Nc represents the number of classes present in the data set. The value of Ir lies in the range:1 ≤ Ir < ∞ and Ir = 1 implies that the data set is completely balanced having equal instances of all classes. Let us remind here that SPID4.7 discretizer converts each original data set into discrete form and handles missing values suitably. Also, it modiﬁes slightly the number of instances and relevant attributes, whenever required. For example, the original Annealing data set has total 38 non-categorical attributes. But SPID4.7 has reduced this number to 20, since 18 attributes in the original data set contain simply missing values throughout the set. Similarly, the 5th attribute in the Heart (swis) data remains constants for all instances, and therefore it is deleted by the SPID4.7. Interestingly,

246

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

Table 1 Summary of the UCI data sets. Problem name

Number of non-target attributes

Number of classes

Number of examples

Imbalance ratio

% of minority class with minimum instances

% of majority class with majority instances

Adult Annealing Credit Dermatology Ecoli Glass Heart (Hungarian) Heart (Swiss) Iris Liver disorder New-thyroid Pima Indian Segment Soyabean-large Tae Wine Wisconsin Yeast

14 38 15 34 8 10 13 13 4 6 5 8 19 35 5 13 10 9

2 6 2 6 8 7 5 5 3 2 3 2 7 19 3 3 2 10

48,842 798 690 366 336 214 294 123 150 345 215 768 2310 307 151 178 699 1486

1.7468 2.7679 1.0246 1.0526 1.2495 1.1571 1.7389 1.1409 1.0000 1.0522 1.7673 1.2008 1.0000 1.0402 1.0005 1.0191 1.2133 1.1803

23.91 1.00 44.49 5.4 0.5 4.2 5.1 4.06 33.33 42.02 13.95 34.89 14.28 0.32 33.33 26.96 34.47 0.33

76.07 76.00 55.50 30.60 42.55 35.51 63.94 39.02 33.33 57.98 69.76 65.11 14.28 13.02 33.33 39.88 65.52 31.19

Note: For more details see Appendix A.

one may refer Appendix A for more details of the selected data sets including their class distributions. From Table 1, it is quite obvious that the selected data sets are chosen from different domains. Again, these are very varied in terms of number of classes, number of features and number of instances. The number of classes ranges up to 19, the number of features ranges from 4 to 38 and the number of instances ranges from 123 to 48,842. Again, the imbalance ratio varies from 1 to 2.7679. Further, looking into the imbalance ratios of the data sets as well as their proportions of the minority classes with minimum instances (%)) and majority classes with majority instances (%), we may comment that the selected problems are mixed of balanced and imbalanced problems. Although, some problems, namely, Adult, Annealing, Ecoli, Heart (Hungarian), Heart (Swiss), New-thyroid, Pima-Indian are comparatively much imbalanced. Note that as a case study for evaluating the proposed GA on volumetric data, we have used the Adult data set, which is one of the largest public domain data sets in the well-known data repository of the UCI. The details of the data set is described in Appendix A. 4.2. Experiment on the proposed sampling technique To verify the strength of our proposed sampling technique, we have taken 9 data sets out of the selected 18 data sets (as shown in Table 1). These are placed separately at Table 2. Note that, out of nine problems, the ﬁrst seven are comparatively imbalanced, and the last two are perfectly balanced. Now, we conduct three subexperiments using C4.5 learners, namely, e1 , e2 and e3 on these nine data sets, where each of e1 , e2 and e3 consists of 10 runs. 4.3. Experiment e1 on random (30–70%) sampling At each run of e1 , we randomly pick 30% examples of each data set as training and the remaining as test sets without concentrating on class distribution. Then C4.5 is run on the same training set and the trained model is tested on the same test set. Accuracy result at each run is recorded. 4.4. Experiment e2 on equi-distributed (30–70%) sampling On the other hand, at each run of e2 , we follow our suggested splitting approach (i.e., 30–70%) (discussed in Section 2.1) to separate training and test sets. Next, C4.5 is trained on the training

set and the induced knowledge is tested on the test set. Here too, accuracy result at each run is recorded. 4.5. Experiment e3 on equi-distributed (40–60%) sampling Of greater interest, we follow our suggested splitting approach to select 40% of each class examples over the entire data set of each problem, and the remaining 60% as test set. Then we apply C4.5 learner as usual on training set to compute accuracy on respective test set. Let us remind here that accuracy (acc) is always calculated in this study using Eq. (4). Finally, 10 results for each data set found from each sub-experiment are averaged and shown in Table 2. In addition, a standard deviation (s.d.) is calculated based on the 10 results, and displayed in the table. 4.6. Experiment for the performance analysis of the proposed learning classiﬁer To have good estimate, all the classiﬁers are run 20 times, each time on a distinct training set and a test set of each of 18 data sets. Note that each training set and test set against each data set are selected by our proposed data splitting strategy discussed in Section 2.1. Next, each classiﬁer is trained on the training set, and then the trained model is run on the test set to measure accuracy. However, the training and test sets for each data set decided for each distinct run are used by all the classiﬁers for that run only. In other words, at every run, two distinct sets (one training set and one test set) for each data set are ﬁrst decided following the suggested data sampling approach, and then individual classiﬁer is trained and the induced knowledge is tested. Note that 80 generations are produced in each run of GA. The train accuracies on each data set achieved by individual classiﬁer are averaged over all 20 results. In addition, a standard deviation along with each mean result is reported. Standard deviation is important, since it generalizes the overall performance of classiﬁer. Table 3 summarizes the performance of the suggested learners on the discretized data sets. In the purpose of measuring performance of classiﬁer at each run, simply classiﬁcation accuracy of rule set is considered here. It is deﬁned below. Accuracy =

No. of test examples correctly classiﬁed by the rule set × 100 Total number of test examples

(4)

In addition, to understand the main difference between the replaced rules and the original rules, one example with original rule

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

247

Table 2 Comparative performance of the proposed and the usual data-splitting techniques on the UCI data sets accuracy rate (%) of the following approaches consisting of 30% training data. Problem name

Random (30–70%) data sampling and C4.5 (acc ± s.d.)

Adult Annealing Ecoli Heart (Hungarian) Heart (Swiss) New-thyroid Pima Indian Iris Tae

77.08 88.15 79.15 63.16 40.25 90.11 73.10 93.02 52.02

± ± ± ± ± ± ± ± ±

2.35 3.92 5.92 6.01 7.04 3.77 3.08 3.76 5.65

82.21 92.02 84.12 69.28 44.38 94.56 78.08 95.78 55.86

set (learned by C4.5) as well as optimized rule set (i.e., reﬁned by DTGA) of a run is a added into Appendix C. 4.7. Sensitivity analysis In real-world data sets, there are many problems in data quality, which affect the prediction of classiﬁer. Missing data is a quite common problem of data quality in data sets. However, the capability of handling such problems varies from classiﬁer to classiﬁer. In this respect, sensitivity analysis can help to determine the dependency of the model on the structure and hypothesis of the environment. In fact, if a tiny change of an input leads to great changes of the output, then the model is said to be highly sensitive to that input. In this view, another experiment is carried out on all the suggested learners as follows: First, we separately select a training set (labeled as E train ) and a test data set (labeled as E test ) for each problem in addition to the 20th iteration of the just earlier experiment, following the suggested data-sampling. Then, induced knowledge by each learner on E train is tested on E test , and the overall accuracy is recorded. Next, in the purpose of sensitivity checking of the learners, we refer here the idea proposed by Lei et al. [43]. In fact, they have analyzed the changes of classiﬁcation accuracies as per the variation of proportions of missing data in the data set. Based on their observation, a classiﬁcation algorithm is sensitive to missing data if the change is not considerably small. Statistically, if the proportion of missing values in the dataset exceeds 20%, then there

Equi-distributed (40–60%) data sampling and C4.5 (acc ± s.d.)

Equi-distributed (30–70%) data sampling and C4.5 (acc ± s.d.) ± ± ± ± ± ± ± ± ±

1.46 2.52 4.64 5.26 6.46 3.26 1.98 3.35 5.02

83.45 91.83 79.43 71.28 44.21 93.12 79.32 97.25 57.35

± ± ± ± ± ± ± ± ±

1.21 2.21 5.49 4.34 6.16 3.52 1.61 3.04 4.82

is an obvious decrease in the classiﬁcation accuracy. That is why we have intentionally missed a 25% of examples in the data set (E test ) of each problem by randomly incorporating missing value like‘?’. Then, each of the suggested classiﬁers is simply run on the same corrupted test set (denoted as E test ) to measure again the overall classiﬁcation accuracy (t2 ). Finally, in Table 4, the decrement ( = t1 − t2 ) of the accuracies achieved by each classiﬁer corresponding to each problem is shown within parenthesis along with the overall accuracy (t1 ) achieved from uncorrupted dataset E test . 4.8. Results and analysis In this section, we provide the results of the experiments and discuss the performances of the sampling approach and the learning algorithms over the data sets. The approach is implemented in Java-1.4.1. The results of Table 2 show that the proposed sampling (Equi-distributed: 30–70%) claims to yield better learning performance over any data set in comparison to random (30–70%) sampling as well as (Equi-distributed: 40–60%) sampling and but it achieves better performance in comparison (Equi-distributed: 40–60%) or more when the data sets are imbalanced. In addition, Appendix D also is included to provide evidences of its strength. Now, based on the results of Table 2 and the evidences shown at Appendix D, we may hopefully demand that the suggested sampling technique is an essential pre-requisite component for learning model but we may not ensure that it is one optimal solution in this respect.

Table 3 Comparative performances of NN, Naïve Bayes, C4.5, DTGA and rule-based classiﬁer using rough set on the UCI data sets accuracy rate (%) of the following approaches (acc = average accuracy, s.d. = standard deviation). Problem name

NN (acc ± s.d.)

Adult Annealing Credit Dermatology Ecoli Glass Heart (Hungarian) Heart (Swiss) Iris Liver disorder New-thyroid Pima Indian Segment Soyabean-large Tae Wine Wisconsin Yeast

73.23 92.17 82.91 94.65 85.65 69.94 69.43 43.02 97.22 72.71 94.16 77.01 95.13 86.34 57.02 97.41 94.32 51.34

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

2.43 2.45 2.39 3.12 4.32 7.74 6.14 5.12 2.38 3.61 3.38 2.35 2.96 4.92 5.86 2.27 2.38 2.72

Naïve Bayes (acc ± s.d.) 79.20 85.61 84.55 95.14 86.21 70.38 73.03 40.73 96.17 76.06 95.02 78.63 89.21 87.94 58.94 97.62 95.06 55.88

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

2.07 3.89 2.05 2.89 3.82 7.08 5.17 5.29 2.61 2.97 3.08 2.05 4.38 4.56 4.22 2.38 1.98 3.87

C4.5 (acc ± s.d.) 81.28 92.46 82.15 94.78 85.72 72.85 71.81 43.87 96.64 74.86 94.46 77.28 94.47 87.89 57.97 97.12 94.91 53.98

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

1.94 2.37 2.43 3.07 4.12 7.24 5.81 6.92 3.21 3.08 3.07 2.26 3.18 4.32 5.38 2.69 2.17 4.07

DTGA (C4.5 + GA) (acc ± s.d.) 85.08 95.05 89.67 97.61 90.86 79.37 78.08 52.82 98.02 80.02 97.82 84.92 97.05 91.47 61.98 98.11 98.07 59.23

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

1.04 1.94 1.43 2.45 3.46 4.59 4.63 3.62 1.82 1.85 2.35 1.37 2.56 2.58 3.14 1.37 1.13 1.97

Rule-based classiﬁer using rough set (acc ± s.d.) 82.26 90.84 83.27 96.12 82.78 71.87 60.82 39.78 95.37 76.07 91.33 79.81 94.56 86.31 52.13 96.78 97.55 47.25

± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

1.12 2.12 2.58 2.34 4.96 7.69 6.34 7.12 2.87 2.26 3.65 1.91 2.95 3.97 5.31 1.94 1.22 4.72

248

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

Table 4 Performance of ﬁve classiﬁers on the corrupted test data sets [t1 = accuracy (%) over E test (without missing data), t2 = accuracy (%) over corrupted E test (i.e., a 25% of missing data in E test ), ( = t1 − t2 )]. Problem name

NN (t1 , )

Naïve Bayes (t1 , )

C4.5 (t1 , )

DTGA (C4.5 + GA) (t1 , )

Rule-based classiﬁer using rough set

Adult Annealing Credit Dermatology Ecoli Glass Heart (Hungarian) Heart (Swiss) Iris Liver disorder New-thyroid Pima Indian Segment Soyabean-large Tae Wine Wisconsin Yeast

69.18 (4.06) 90.72 (4.63) 83.87 (3.57) 95.11 (0.16) 85.14 (3.74) 70.43 (4.34) 69.31 (0.96) 42.76 (1.43) 96.13 (2.06) 72.17 (2.59) 94.35 (3.97) 76.21 (2.32) 94.12 (2.82) 86.07 (1.46) 56.92 (1.49) 95.14 (1.42) 95.92 (1.10) 52.12 (2.03)

78.34 (1.79) 83.40 (3.16) 85.20 (2.37) 96.41 (1.19) 85.62 (3.34) 70.67 (3.89) 71.82 (0.00) 40.38 (1.37) 94.08 (2.41) 75.12 (2.42) 96.02 (2.21) 78.13 (1.59) 89.79 (1.37) 86.92 (0.83) 57.83 (2.76) 96.17 (0.67) 96.11 (0.73) 55.87 (0.58)

80.56 (2.06) 92.55 (2.44) 82.14 (2.92) 95.58 (1.67) 85.31 (4.21) 71.56 (5.82) 70.87 (2.44) 43.96 (1.25) 94.22 (2.71) 73.85 (3.03) 95.83 (1.46) 77.26 (2.30) 93.77 (1.68) 86.76 (2.65) 56.12 (2.83) 95.46 (1.22) 94.16 (0.89) 54.71 (2.62)

84.09 (1.16) 95.21 (2.03) 88.62 (1.97) 97.38 (0.96) 91.34 (3.78) 79.21 (3.92) 77.21 (0.43) 52.15 (0.00) 97.08 (1.94) 79.83 (2.25) 97.44 (1.04) 83.76 (1.78) 97.47 (1.19) 91.61 (2.08) 60.53 (2.97) 97.71 (1.07) 96.74 (0.78) 60.13 (1.58)

81.07 (2.86) 87.67 (3.05) 83.20 (1.68) 95.79 (1.02) 80.73 (2.72) 71.13 (3.43) 59.22 (0.89) 40.09 (0.92) 94.73 (2.18) 75.04 (2.82) 90.82 (2.38) 77.04 (1.94) 94.73 (1.25) 85.73 (1.13) 49.13 (2.32) 95.92 (1.04) 96.41 (0.72) 46.77 (2.49)

On the basis of the results of Tables 3 and 4, we summarize the strong points of our proposed system as follows: • First of all, just looking into the performance Table 3, it is obvious the DTGA classiﬁer achieves better average accuracy results than the other three competent learners: Neural Network, Naïve Bayes and Rough-set based Rule inducer over any kind of data set irrespective to domain, size, dimensionability and imbalance nature of data. Again, the accuracy rate of DTGA is signiﬁcantly larger than the accuracy rate of pure C4.5 on all the 18 data sets. In other words, any of the selected learners did not improve classiﬁcation accuracy on any data set so much as the DTGA has bagged. • Secondly, with respect to the results on the more likely imbalanced data sets (shown at Table 3), we see that DTGA dominates heavily its base classiﬁer C4.5 in al most all cases, and the other classiﬁers: Neural Network, Naïve Bayes and Rough-set based Rule inducer in many cases such as: Adult, Ecoli, Heart (Hungarian), Heart (Swiss), Pima-Indian. This reveals that DTGA is a good solution for imbalanced data sets too. • Thirdly, on considering the presence of default rule in each of the rule sets of C4.5 and DTGA, one may easily realize from the results of Table 3 that the rate of mis-classiﬁcation (%) of DTGA on each data set in comparison to pure C4.5 is minimized enough, since overall accuracy (%) achieved by DTGA is successfully improved than that of C4.5. This implies that DTGA has high capability to remove noisy rules. Especially, this proves the strength of the suggested ﬁtness function. • Fourthly, as can be seen in the performance Table 3, the DTGA achieves smaller standard deviation nearly in all the selected problems as compared to the other used learners, which conﬁrms much reliability for prediction. • Again, analyzing the results of Table 4 for testing sensitivity characteristic of the classiﬁers, it is true that Naïve Bayes is the least sensitive, whereas NN is the most sensitive to missing data among the suggested ﬁve classiﬁers. Although, DTGA is showing less sensitivity than each of C4.5, NN and the rule based classiﬁer (using rough set theory) almost in all cases. More importantly, we see that sensitivity of DTGA on most of the selected unbalanced data sets such as: Adult, Annealing, Heart (S), New-Thyroid (as shown in Table 2), is very less in comparison to the other learners. • As stated above that rough set theory is a powerful tool for ﬁnding hidden patterns in data as well as minimal sets of data (data reduction), generating sets of decision rules from data. Even, it has the

ability to generate rules from minority classes. But the results of Table 3 interprets that the rule-based classiﬁer using rough set theory also is unable to compete with DTGA. Although, in many cases, it is showing better performance in comparison to pure C4.5, NN, and Naïve Bayes. • Lastly, comparing the performances of the used learners on larger data set Adult, we are hopeful that DTGA based on the suggested sampling technique is more likely successful to operate volumetric data set, because GA-based system usually suffers from improvement of accuracy over larger data set. Certainly, this is a promising feature in favour of DTGA. Furthermore, of greater interest, we present below a short comparative study between DTGA and n2 -classiﬁer on the empirical results of ﬁve data sets: Ecoli, Glass, Iris, Soybean-large and Yeast. Actually, the n2 -classiﬁer was ﬁrst introduced by Jelonek et al. [10]. This kind of classiﬁer is a specialized approach to solve multiple class learning problems. It is composed of (n2 − n)/2 base binary classiﬁers (where n is a number of decision classes; n > 2). The results on Ecoli, Glass, Iris, Soybean-large and Yeast obtained by n2 -classiﬁer [22] are respectively 81.34 ± 1.7, 74.82 ± 1.4, 95.53 ± 1.2, 91.99 ± 0.8, 55.74 ± 0.9, where each result shows the mean accuracy over 10 folds with standard deviation. Now, consulting Table 3 for direct comparison between DTGA and n2 -classiﬁer on these datasets, we may easily comment that the performance of DTGA is better than the n2 -classiﬁer. Although, the standard deviation results found by n2 -classiﬁer on these data sets are no doubt good in comparison to DTGA. 5. Conclusion and future work In particular, the investigations introduced in [17,19] and in the present article are all associated with reﬁning/optimizing rules learned by C4.5 by applying GA. Shortly speaking, all these are mainly focusing on designing new novel evaluation function, variation on encoding/decoding strategy as well as crossover operation to improve the overall classiﬁcation accuracy irrespective to the domain, size and dimensionality of the data sets in comparison to varieties of classiﬁers. Besides, data sampling is another interesting phenomenon of the approaches to reduce the biasness towards training set as well as to overcome imbalance problem occurred in data sets to a great extent. Now, in context of the present article, we summarize the main points as follows.

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

• Through experiments, it is observed that our proposed hybrid learning system DTGA shows better classiﬁcation performance in most cases as compared to the existing learners suggested in this study. Simultaneously, the rate of classiﬁcation-error achieved by DTGA on each data set is less as compared to its base learner C4.5. • Again, the proposed system is comparatively less sensitive to missing data in comparison to C4.5 and NN, and fairly close to Naïve Bayes. • To overcome the imbalance problem of data set, we mainly concentrate on data sampling in which much attention is paid to maintain almost equal proportion of class distribution in the training set. Such a technique claims also to select overall informative and balanced features for classiﬁcation at the very beginning of learning. In addition, replacement of rule of speciﬁc class is here performed only in place of the worst rule with same class in the rule set, and this characteristic also takes care to improve classiﬁcation performance even if the data set is imbalanced. • Moreover, the approach easily handles the interpretability problem that occurs in most of the learning classiﬁer system for applying GA. • Finally, DTGA claims less time complexity as compared to most of the GA based approaches, since any encoding and decoding technique is not applied here and the best results are usually achieved within 80 generations. In future, the proposed method can be parallelized to further improve the prediction accuracy of DTGA especially over volumetric data sets, covering large search space in order to ﬁnd better quality solutions. Further, decimal values of the rules can be encoded to maintain the diversity of the population for any kind of problem through different genetic operators in less amount of time. Acknowledgement The authors are grateful to Ambuj Kumar, former student of the Department of Computer Science and Engineering, BIT, Mesra, for implementation of the genetic algorithm proposed in this study. Appendix A. A.1. Data set description All the datasets used in the experimental evaluation are available from the University of California, Irvine, (UCI) Repository of Machine Learning Databases. Each data set (i.e., problem) is brieﬂy described. For more information on the Repository, consult http://www.ics.uci.edu/∼ mlearn/MLRepository.html. However, all the datasets are ﬁrst discretized by SPID4.7 discretizer, and then used in our experiment.Note: Expression of the form x(y) indicates that there are ‘y’ number of entries for class value/code/name ‘x’. As for example, 4(60) indicates that there are 60 instances of class value 4, i.e., it is used to show class distribution. A.1.1. Adult The Adult data set (originally called the “Census Income” Database of United State Census Bureau) predicts whether income of an individual belonging to a certain category based on census data exceeds $50 K/yr. The original database contains of 48,842 observations of US citizens, each has 14 non-target attributes and one binary valued target attribute (called as income). In fact, the levels of income are considered as: < = 50 K (small) and > 50 K (large). However, due to the limitation of memory, 2/3rd of this volumetric data set (i.e., 32,561 out of 48,842 is considered in our experiment. The class distribution of the reduced data set is shown below.

249

Class distribution: 0(24,720), 1(7841), where 0 represents 50 K (small) and 1 represents > 50 K (large), whereas it is 37,155(0) 11,687(1) with respect to the original data set. A.1.2. Annealing This database has 798 instances with total 38 non-categorical attributes and 6 different classes. However, 18 attributes (out of 38) have missing values throughout the database, and so these are deleted by the discretizer adopted in our experiment. Distribution of Classes: 1(8), 2(88), 3(608), 4(0), 5(60), U(34). A.1.3. Credit This problem concerns credit card applications. It has 15 nontarget attributes, one target attribute with 2 class values (+ and −) and 690 observations in total. Class distribution: +(307), −(383). A.1.4. Dermatlogy The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are, namely, psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris. Usually a biopsy is necessary for the diagnosis but unfortunately these diseases share many histopathological features as well. This database contains 366 examples and 34 non-categorical attributes (33 of which are linear valued and one of them is nominal). Class distribution: Psoriasis (112), seboreic dermatitis (61), lichen planus (72), pityriasis rosea (49), cronic dermatitis (52), pityriasis rubra pilaris (20). A.1.5. Ecoli Data give characteristics of each ORF (potential gene) in the Ecoli genome. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided in the databases. The data set contains 8 non-target attributes, 336 observations with 8 classes. However, the 5th attribute takes same value almost in all the records of the databases, and so it is deleted by our discretizer at the pre-processing phase. Class distribution: cp (cytoplasm) (143), im (inner membrane without signal sequence) (77) pp (perisplasm) (52), imU (inner membrane, uncleavable signal sequence) (35), om (outer membrane) (20), omL (outer membrane lipoprotein) (5), imL (inner membrane lipoprotein) (2), imS (inner membrane, cleavable signal sequence) (2). A.1.6. Glass This data set is donated Vina Spiehler from Diagnostic Products Corporation. Basically, the study of classiﬁcation of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identiﬁed. Originally, the data set has 214 examples, each of which contains 10 non-target attributes and takes one of 7 class values. However, the ﬁrst attribute basically gives an id number (record number), and so it has no importance in the task of classiﬁcation. Therefore, it is simply deleted from our experiment. Again, no data of glass-type-4 is found in the database, and so it is also ignored in our experiment.Type of glass: (class attribute): (1) building windows ﬂoat processed (76), (70), (2) building windows non ﬂoat processed (3) vehicle windows ﬂoat processed (17), (4) vehicle windows non ﬂoat processed (0), (5) containers (13), (6) tableware (09), (7) headlamps (29).

250

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

A.1.7. Heart (Hungary and Swiss) The data set are donated by David W. Aha. This database contains information about heart from regions like Cleveland, Hungary, Switzerland, etc. Although, the number of available examples varies from region to region. Interestingly, out of around 76 attributes, most of experiments are using a subset of 14 relevant attributes (including the class attribute) available in processed form. The “goal” ﬁeld refers to the presence of heart disease in ﬁve forms in the patient. It takes an integer value out of 5, starting from 0 (absence of heart disease) to 4. In particular, the Heart (Hungary) and Heart (Swis) data sets contain respectively 294 and 123 cases. Class distribution Hungarian Switzerland

0 188 8

1

2

3

37 48

26 32

28 30

4 15 5

Total 294 123

A.1.8. Iris (Iris plants) This data set corresponds to the well known Iris data set by R.A. Fisher. It contains total 150 patterns in four dimensions measured on Iris ﬂowers of 3 different species, where each class refers to a type of iris plant. Class distribution: Iris Setosa (50), Iris Versicolour (50), Iris Virginica (50) A.1.9. Liver disorder The liver disorder data set was donated by Richard S. Forsyth from the collected data from BUPA Medical Research Ltd. This problem has 345 instances with 6 non-target attributes. The ﬁrst 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. Each line in the bupa.data ﬁle constitutes the record of a single male individual. The selector ﬁeld used to split data into two sets: 1 and 2. Class distribution: 1(145), 2(200). A.1.10. New thyroid It is a medical data set regarding thyroid gland. This data set consists of 215 points from 3 classes: 1(normal), 2(hyper) and 3(hypo) in a ﬁve-dimensional spaces. Class distribution: 1(150), 2(35), 3(30). A.1.11. Pima-Indian The problem is to predict whether a patient would test positive for diabetes given a number of pathological measurements and medical test results. The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 h post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The patients in the database are females at least twenty-one years old of Pima Indian heritage, living near Phoenix, Arizona, USA. There are 2 classes, 8 numerical attributes and 768 records in the database. Class distribution: (class value 1 is interpreted as “tested positive for diabetes”, 0 for negative): 0(500), 1(268). A.1.12. Segment The problem consists on identifying an outdoor image. The instances were drawn randomly from a database of 7 outdoor images. The images were hand segmented to create a classiﬁcation for every pixel as one of brickface, sky, foliage, cement, window, path or glass. It was used in the Statlog project. There are 7 classes, 19 numerical attributes and 2310 records. Classes: brickface, sky, foliage, cement, window, path, grass. There are 330 instances per class in the data ﬁle.

A.1.13. Soybean There are 35 non-categorical attributes (some nominal and some ordered) and 19 classes in this problem. Usually, the ﬁrst 15 classes (out of 19) have been used in prior work. The folklore seems to be that the last four classes are unjustiﬁed by the data since they have so few examples. The database has 307 instances. Class distribution: (1) diaporthe-stem-canker (10), (2) charcoalrot (10), (3) rhizoctonia-root-rot (10), (4) phytophthora-rot (40), (5) brown–stem–rot (20), (6) powdery-mildew (10), (7) downymildew (10), (8) brown-spot (40), (9) bacterial-blight (10), (10) bacterial-pustule (10), (11) purple-seed-stain (10), (12) anthracnose (20), (13) phyllosticta-leaf-spot (10), (14) alternarialeaf-spot (40), (15) frog-eye-leaf-spot (40), (16) diaporthe-pod-&-stemblight (6), (17) cyst-nematode (6), (18) 2-4-d-injury (1), (19) herbicide-injury (4).

A.1.14. TAE (teaching assistant evaluation) The data set consists of evaluations of teaching performance over three regular semesters and two summer semesters of 151 teaching assistant (TA) assignments at the Statistics Department of the University of Wisconsin–Madison. The scores were divided into 3 roughly equal-sized categories (“low (1)”, “medium(2)”, and “high(3)”) to form the class variable. Class distribution: 3(52), 2(50), 1(49).

A.1.15. Wine The donor of this set is Stefan Aeberhard. However, the original owner is M. Forina et al., PARVUS an extendible package for Data Exploration, Classiﬁcation and Correlation, Institute of Pharmaceutical and Food Analysis and Technologies, Brigata Salerno, 16147 Genoa, Italy. These data are, in fact, the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents(non-target attributes) found in each of the 3 types of wines. It contains 178 instances. Class distribution: class-1(59), class-2(71), class-3(48).

A.1.16. Wisconsin This one is the breast cancer(original) databases available at UCI. It was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. The data contains various breast biopsy measurements collected by oncologists through ﬁne needle aspiration. Its objective is to predict whether a tissue sample taken from a patient’s breast is malignant or benign. There are 2 classes, 10 numerical attributes and 699 observations in the databases. However, the ﬁrst attribute in the data ﬁle denotes an id-number (simply a sample code), and so it is less practicable in context of classiﬁcation. Therefore, it is ignored in our present experiment. Class distribution: Benign (458), Malignant (241).

A.1.17. Yeast This data set gives a statistic on protein with 1484 instances with 10 classes, each consisting of 9 non-target attributes. Although, the 7th attribute remains almost same throughout the observations, and so it is ignored in many experiments, and even our too. Class distribution: CYT (cytosolic or cytoskeletal) (463), NUC (nuclear) (429), MIT (mitochondrial) (244), ME3 (membrane protein, no N-terminal signal) (163), ME2 (membrane protein, uncleaved signal) (51), ME1 (membrane protein, cleaved signal) (44), EXC (extracellular) (37), VAC (vacuolar) (30), POX (peroxisomal) (20), ERL (endoplasmic reticulum lumen) (5).

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

Appendix B. In this section, a tiny classiﬁcation problem named golf-playing is presented to provide clear ideas on several points discussed throughout the article. Truly speaking, this is one of the problems on which the interface [26] s/w is run. This will help to have a general idea on classiﬁcation problem.

B.1. Problem Golf-playing.

B.2. Description The problem is mainly dependent on weather. The problem describes under what conditions of weather, game golf is suitable for playing. In general, instances in a dataset are characterized by the values of features, or attributes, that measure different aspects of the instance. In this case, four non-target attributes: outlook, humidity, temp, and windy are taken into account to forecast whether playing golf is possible on a day or not. In this purpose, observed values of each these attributes against each day of some days are necessary. A sample data set consisting of different 14 days’ observations is given below. Also, a description consisting of possible values of the attributes is shown.

B.3. Non-target attributes with possible values Outlook: sunny, overcast, rain, Humidity: continuous, Temperature: continuous, Windy: true, false.

B.4. Target attribute Playing-decision: Play, Don‘t. All these attributes are stored in a ﬁle say golf.name.

B.4.1. Training examples from original ‘golf.data’ Day

Outlook

Humidity

Temp

Windy

Playing-decision

1 2 3 4 5 6 7 8 9 10 11 12 13 14

sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain

85 88 82 94 70 65 66 85 70 72 65 90 75 80

46 45 42 25 20 15 14 28 15 36 25 22 41 21

false true false false false true true false false false true true false true

Don‘t Don‘t Play Play Play Don‘t Play Don‘t Play Play Play Play Play Don‘t

Note that the attributes of the data set contain both nominal and continuous values. Such attributes can be discretized to reduce the range of values in case of continuous attributes, and to normalize the nominal attributes such as: 1 for high, 2 for normal, etc. Let us assume their values as follows: Playing-decision

Play(1)

yes

Don‘t(0): no

Outlook Humidity Temperature Windy

sunny(1) high(1): ≥ 75, hot(1): >36 true(1)

overcast(2) normal(2) mild(2): (20, 36) false(2)

rain(3) <75 cool(3): ≤20

251

B.4.2. Discretized ‘golf.data’ As per the above-assumed ranges as well as their respective mapped values for the attributes, we get the following discreetized data set corresponding to the above original data set. Day

Outlook

Humidity

Temp

Windy

Playing-decision

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 1 2 3 3 3 2 1 1 3 1 2 2 3

1 1 1 1 2 2 2 1 2 2 2 1 2 1

1 1 1 2 3 3 3 2 3 2 2 2 1 2

2 1 2 2 2 1 1 2 2 2 1 1 2 1

0 0 1 1 1 0 1 0 1 1 1 1 1 0

Note that the role of discretizer is to provide discretized values of continuous and nominal attributes of each data set, following its own strategy (i.e., range values corresponding to discrete values for continuous attributes in data set may vary from discretizer to discretizer). However, in the present experiment, we the selected data sets are discretized by SPID4.7 discretizer.

B.4.3. Rules generated by C4.5 Five primary rules (in IF–THEN structure) induced by C4.5 are shown below. The conclusion of each rule shows the class as well as information (measured by %) in the format: class value [% of training instances covered]. The classiﬁer itself arranges the produced rules in descending order of information covered but grouped by class. Rule 3: Outlook = 2 → class 1 [70.7%]. [This rule is realized as: IF (Outlook = 2) THEN (Playingdecision = 1)]. Rule 2: Humidity = 2 → class 1 [66.2%]. [This rule is realized as: IF (Humidity = 2) THEN (Playingdecision = 1)]. Rule 5: Outlook = 3 AND Windy = 2 → class 1 [63.0%]. [This rule is realized as: IF (Outlook = 3) AND (Windy = 2) THEN (Playing-decision = 1)]. Rule 1: Outlook = 1 AND Humidity = 1 → class 0 [63.0%]. [This rule is realized as: IF (Outlook = 1) AND(Humidity = 1) THEN (Playing-decision = 0)]. Rule 4: Outlook = 3 AND Windy = 1 → class 0 [50.0%]. [This rule is realized as: IF (Outlook = 3) AND (Windy = 1) THEN (Playing-decision = 0)] [Target attribute is the playing-decision, and target classes ‘0’ and ‘1’ mean here playing-decision ‘No’ and ‘Yes’, respectively.] Obviously, the above rules can be represented by a disjunctive normal form (by the symbol V) of conjunctions(by the symbol , i.e., logical AND) of constraints on the attribute values of the instances as follows. Rule-1 V Rule-2 V Rule-3 V Rule-4 V Rule-5(it is, in fact, a disjunctive normal form), where Rule-1 (one disjunct) is again collection of conjuncts(i.e., pre-conditions) connected by the symbol such as: (outlook =1) (humidity =1) → class 0, and so on.

B.4.4. Formation of discreetized rule set by Interface corresponding to the above rule set, eliminating IF–THEN clause, and it is shown below:

252

Rule 3 Rule 2 Rule 5 Rule 1 Rule 4

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

Outlook

Humidity

Temp

Windy

Playing-decision

2 * 3 1 3

* 2 * 1 *

* * * * *

* * 2 * 1

1 1 1 0 0

The symbol ‘*’ in a rule denotes here the don’t care symbol, and means that the attribute corresponding to ‘*’ has no importance in that rule. For example, in Rule-3, only one condition consisting of outlook attribute is present. Therefore, conditions corresponding to the attributes: humidity, temp and windy(following their positions in the golf.data) are marked ‘*’. Further, length of each of these rules is considered here 5.

rule set (generated by C4.5 and the hybrid system) corresponding to each data set. Clearly, based on the presence of default rule in each rule set, we may claim that the error-rate (%) of each such rule set on a data set equals to the difference between 100 and the obtained accuracy (%) on that data set. The new rules with new values of attributes are being generated due to crossover, and because of replacement of the worst rules in the existing rule set are being appeared in the replaced rule set.

Appendix D. Appendix C. Let us ﬁrst analyze the proposed sampling technique choosing different percentage of training data on the Heart (Hungarian) data set. The class distribution of the data set is as follows:

C.1. An illustration on replacement of Rule Consider Heart (Hungarian) data set. For this problem, most of the experiments are using a subset of 14 (13 non-categorical and 1 categorical) attributes. The categorical or goal ﬁeld refers to the presence of heart disease in 5 forms in the patient. These are in fact integer values starting from 0 (absence of heart disease) to 4. In this experiment too, 13 non-categorical attributes (denoted as A1 , A2 , . . ., A13 ) and the class attribute (denoted by C) are chosen. The original rule set (learned by C4.5) and the replaced rule set (i.e., reﬁned by GA part of DTGA) corresponding to this problem are shown here side by side in tabular-like form. Obviously, the rules induced by C4.5 are ﬁrst passed to Interface to eliminate IF–THEN form (as discussed in Appendix B). Let us remind here again that the attribute-condition consisting of ‘*’ has no importance in the rule. It is nice to say that the underlined rules in the replaced rule set imply that these rules are the replaced rules corresponding to the respective rules in the original rule set at a speciﬁc run. This is basically shown to give an idea on the replaced rules. The replaced set is here, in fact, captured at a speciﬁc run (consisting of 80 generations) of DTGA, whereas the original rule set is the rule set of its base learner C4.5 at that particular run.

Original Rule Set

Class distribution Switzerland

0 188

1 37

2 26

3 28

4 15

=294

Case-I: 30% training data based on equal proportion in class distribution. Number of examples in the training set in this case is as follows: Class-value:

0

1

2

3

4

Number of examples: Ratio in the training set:

57 0.619

12 0.142

08 0.087

09 0.098

05 0.054

= 91

Case-II: 40% training data based on equal proportion in class distribution. Number of examples in the training set in this case is as follows: Class-value:

0

1

2

3

4

Number of examples: Ratio in the training set:

76 0.633

15 0.125

11 0.091

12 0.010

06 0.050

= 120

Case-III: 50% training data based on equal proportion in class distribution.

Rule set generated and replaced by GA

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 C

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 C

* * * *

2

*

*

*

2 5

*

*

*

3

* * * * 2 * * *

2

5 *

*

*

3

1 1 * 1

*

*

*

* * *

*

*

*

1

* 2 * * 1 * *

*

*

* *

* *

1

* 2 * *

*

*

*

* * * *

*

2

1

* * * * * *

* *

*

* *

* 2

1

* * * *

*

*

*

* 2 6

*

*

*

4

* * * * *

*

* *

2

6 *

* *

4

* * * *

1

*

*

* 2 5 *

*

* 2

*

*

* *

*

* * * 1

5 *

* *

2

6 * * *

*

*

*

* 2 * *

*

* 2

6

*

* *

*

* * *

2 * *

* *

2

* 2 * *

*

*

3

* * *

*

*

* 1

*

2

* *

*

* 3 *

* * *

* *

1

* * * *

* *

*

* 1 * *

*

3 0

*

*

* *

*

* * *

1 * *

* 3

0

* * * *

* *

*

* * * *

*

* 0

*

*

* *

*

* * * * * *

* *

0

Note that the last rule in the above list of rules is the default rule (i.e., a rule with majority class). In fact, it is originally generated by C4.5 for each data set, and placed at the end of the rule set. It participates to measure the overall accuracy of the rule set, but any kind of operations such as: selection, cross-over, replacement, etc. adopted in our investigation is not performed on it. In other words, it is unchanged throughout our experiment. Let us remind here that matching of input for ﬁnding accuracy is performed in sequential manner, i.e., for a given input we ﬁrst check rule-1. If matches, then ﬁne, else go to rule-2 and so on. When all the rules fail, then the default rule comes into action. This strategy is followed only for each

Number of examples in the training set in this case is as follows: Class-value:

0

1

2

3

4

Number of examples: Ratio in the training set:

94 0.630

19 0.1275

13 0.087

15 0.100

08 0.0536

= 149

Case-IV: 60% training data based on equal proportion in class distribution

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

Number of examples in the training set in this case is as follows: Class-value:

0

1

2

3

4

Number of examples: Ratio in the training set:

113 0.634

23 0.129

16 0.089

17 0.095

09 0.050

Equi-distributed 30% training data

Problem name

Equidistributed (30–70%) sampling (Case-I)

Equidistributed (40–60%) sampling (Case-II)

Equidistributed (50–50%) sampling (Case-III)

Equidistributed (60–40%) sampling (Case-IV)

Ecoli Heart (Hungarian) Heart (S)

82.60 (17) 68.40 (09)

78.80 (22) 70.50 (12)

81.80 (25) 72.40 (14)

82.95 (29) 73.20 (17)

43.40 (09)

45.50 (12)

38.40 (14)

41.20 (17)

= 178

Note that here ceil operator is used for including examples of each class type. The above analysis shows that ratios of the minority classes (especially in case of imbalance data set) decrease and the majority classes increase, as the percentage (%) of training set increases. Number of rules generated by C4.5 from different training sets of Heart (Swiss) data sets at a particular run are recorded in the table given below. Also accuracy results of the rule sets on their respective test sets are shown. The rule sets in tabular like form (after applying Interface s/w) also are presented below. Heart (Swiss) data set Random 66% training data

Property

Random 30% training data

Number of C4.5 rules on the training data sets Accuracies of C4.5 rule sets on the respective test sets

10

10

12

36.78%

45.98%

42.86%

Rule set generated by C4.5 over random 30% training data A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A12

C

* * * * * * * * * *

* * * * * * * * * *

* * 2 4 * * * * 4 *

3 * * 2 * 5 3 * 4 *

* * * * * * * * * *

* * * * * * * * * *

3 * * * * * 1 * * *

* * * * * * * * * *

* 4 * * 3 2 * 5 2 *

* * * * * * * * * *

* * * * * * * * * *

* * * * * * * * * *

2 2 0 1 1 1 3 3 3 1

Rule set generated by C4.5 over equally distributed 30% training data A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A12

C

* 3 2 3 * * * * * *

* * * * * * * * * *

* * 4 * 3 * * * 4 *

2 4 * 5 * * 2 3 * *

* * * * * * * * * *

* * * * * * * * * *

3 * * * * * 2 * * *

* * * * * * * * * *

2 * * * * 5 * * * *

* * * * * * * * * *

* * * * * * * * 3 *

* * * * * * * * * *

1 2 2 2 0 3 3 3 4 1

Rule set generated by C4.5 random 66% training data A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A12

C

* * * * * * * * * * * *

* * * * * * * * * * * *

* * 4 * * 3 3 4 * * * *

* 2 5 * * * * * 3 * 4 *

* * * * * * * * * * * *

* * * * * * * * * * * *

* * * * * * 2 * 1 * * *

* * * * * * * * * * * *

1 2 2 * 3 5 * 5 * * * *

2 * * 3 * * * * * * * *

* * * * * * * * * * * *

* 3 * * * * * * * 2 * *

1 1 1 1 1 1 1 3 3 3 2 2

Note that default rule has here class value 2 instead of 1, since in the training set number instances with class value 2 dominates the number of instances with class value 1.

Accuracy (%) of C4.5 on the respective test data set at a run.

253

Value within parenthesis indicates the number of rules generated by C4.5 over the respective training set.

References [1] C. Blake, E. Koegh, C.J. Mertz, Repository of Machine Learning, University of California at Irvine, 1999. [2] P.K. Chan, S.J. Stolfo, A comparative evaluation of voting and meta-learning on partitioned data, in: Proceedings of the 12th International Conference on Machine Learning, San Francisco, 1995, pp. 90–98. [3] K.J. Cherkauer, Human Expert Level Performance on a Scientiﬁc Image Analysis Task by a System Using Combined Artiﬁcial Neural Networks, Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms (1996) 15–21. [4] T.G. Dietterich, Ensemble methods in machine learning, in: Proceedings of the 1st International Workshop on Multiple Classiﬁer Systems, LNCS vol. 1857, Springer Verlag, 2000, pp. 1–15. [5] U.M. Fayyad, G. Piatetsky Shapiro, P. Smyth, R. Uthurusamy, From data mining to knowledge discovery, Advances in Knowledge Discovery and Data Mining (1996) 1–36. [6] J. Gama, Combining Classiﬁcation Algorithms, Ph.D. Thesis, University of Porto, 1999. [7] J.W. Grzymala-Busse, LERS—a system for learning from examples based on rough sets intelligent decision support, in: Handbook of Applications and Advances of the Rough Sets Theory, Kluwer, 1992, pp. 3–18. [8] T. Hastie, R. Tibishirani, Classiﬁcation by Pair Wise Coupling Advances in Neural Information Processing Systems, NIPS97, 10, MIT Press, 1998, pp. 507–513. [9] L. Kuncheva, Combining Pattern Classiﬁers: Methods and Algorithms, Wiley, 2004. [10] J. Jelonek, J. Stefanowski, Experiments on solving multi-class learning problems by the n2 classiﬁer, in: Proceedings of 10th European Conference on Machine Learning ECML 98, LNAI vol. 1398, Springer Verlag, 1998, pp. 172–177. [11] W.J.M. Klosgen, Handbook of Data Mining and Knowledge Discovery, Oxford Press, 2002. [12] C. Merz, Combining classiﬁers using correspondence analysis, Advances in Neural Information Processing Systems 10 (1998) 33–58. [13] R.S. Michalski, I. Bratko, in: M. Kubat (Ed.), Machine Learning and Data Mining, John Wiley & Sons, 1998. [14] W. Jerzy Grzymala-Busse, S. Jerzy, W. Szymon, A comparison of two approaches to data mining from imbalanced data, Journal of Intelligent Manufacturing 16 (2005) 565–573. [15] T.M. Mitchell, Machine Learning, McGraw-Hill, 1997. [16] S. Nowaczyk, J. Stefanowski, Experimental Evaluation of Classiﬁers Based on Combiner Strategy, Institute of Computing Sciences, Poznan University of Technology, Research Report RA001/02, March 2002. [17] B.K. Sarkar, S.S. Sana, K.S. Choudhury, Accuracy Based, Learning classiﬁcation system, International Journal of Information and Decision Sciences 2 (1) (2010) 68–85. [18] J.R. Quinlan, C4.5. Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1995. [19] B.K. Sarkar, S.S. Sana, A hybrid approach to design efﬁcient learning classiﬁers, Journal, Computers and Mathematics with Applications 58 (2009) 65–73. [20] J. Stefanowski, Algorithms of Rule Induction for Knowledge Discovery, Habilitation Thesis Published as Series Rozprawy no. 361, Poznan Univeristy of Technology Press, Poznan, 2001. [21] J. Stefanowski, The bagging and n2 classiﬁers based on rules induced by MODLE, in: Proceedings of the 4th International Conference Rough Sets and Current Trends in Computing, RSCTC’2004, LNAI vol. 3066, Uppsala, Sweden, Springer Verlag, 2004 June 2004, pp. 488–497. [22] J. Stefanowski, On Combined Classiﬁers, Rule Induction and Rough Sets, Transactions on Rough Sets, 6, LNAI, vol. 4374, Journal Subline, Springer Verlag, 2007, pp. 329–350. [23] G. Valentini, F. Masuli, Ensambles of Learning Machines, Neural Nets WIRN Vietri, 2486, Springer-Verlag, LNCS, 2002, pp. 3–19. [24] WEKA 3.4.6, Data Mining Software in Java, http://www.cs.waikato. ac.nz/ml/weka. [25] S. Pal, H. Biswas, SPID4.7: discretization using successive pseudo deletion at maximum information gain boundary points, in: Proceedings of the Fifth SIAM International Conference on Data Mining, Newport California SIAM, 2005, pp. 546–550. [26] B.K. Sarkar, K. Sachdev, S. Bharati, A. Bhaskar, An interface for converting rules generated by C4.5 to the most suitable format for genetic algorithm, in: Proceedings of the Eighth International Conference on IT (CIT-2005), 20–23 December, Bhubaneswar, India, 2005, pp. 113–115.

254

B.K. Sarkar et al. / Applied Soft Computing 12 (2012) 238–254

[27] E. Bernado-Mansilla, J.M. Garella-Guiu, Accuracy-based learning classiﬁer systems: models analysis and applications to classiﬁcation tasks, Evolutionary Computation 11 (3) (2003) 209–238. [28] R. Sikora, Learning control strategies for chemical process: a distributed approach, IEEE Export (1992) 35–43. [29] R. Sikora, M. Shaw, A doubled-layered learning approach to acquiring rules for classiﬁcation: integrating genetic algorithms with similarity-based learning ORSA, Journal on Computing 6 (1996) 174–187. [30] R. Sikora, S. Piramuthu, An intelligent fault diagnosis system for robotic machines, International Journal of Computational Intelligence and Organizations 1 (1996) 144–153. [31] I. Lee, R. Sikora, M. Shaw, A genetic algorithm based approach to ﬂexible ﬂowline scheduling with variable lot sizes, IEEE Transactions on Systems, Man, and Cybernetics 27B (1995) 36–54. [32] J. Koza, Genetic Programming On the programming of Computers by Means of Natural Selection, MIT Press, Cambridge, London, 1992. [33] P.C. Chang, C.H. Liu, Y.W. Wang, A hybrid model by clustering and evolving fuzzy rules for sales decision supports in printed circuit board industry, Decision Support Systems 42 (3) (2006) 1254–1269. [34] C.W.M. Yuen, W.K. Wong, S.Q. Qian, L.K. Chan, E.H.K. Fung, A hybrid model using genetic algorithm and neural network for classifying garment defects, Expert Systems with Applications (2008), doi:10.1016/j.eswa.2007.12.009. [35] K.M. Faraoun, A. Boukleif, Genetic programming approach for multi-category pattern classiﬁcation applied to network intrusions detections, International Journal of Computational Intelligence 3 (2006) 79–90. [36] W.K. Wong, X.H. Zeng, W.M.R. Au, A decision support tool for apparel coordination through integrating the knowledge-based attribute evaluation expert system and the T–S fuzzy neural network, Expert Systems with Applications, Published Online doi:10.1016/j.eswa.2007.12.068. [37] D.E. Goldberg, Genetic Algorithms in Search Optimization and Machine Learning, New York, Addison Wesley, 1989. [38] J.H. Holland, Adaptation in Natural and Artiﬁcial Systems, University of Michigan, Ann Arbor, 1975.

[39] S.W. Wilson, Classiﬁer ﬁtness based on accuracy, Evolutionary Computation 3 (2) (1995) 149–175. [40] N. Japkowicz, S. Stephen, The class imbalance problem: signiﬁcance and strategies, in: International Conference on Artiﬁcial Intelligence, vol. 1, 2000, pp. 111–117. [41] N. Japkowicz, S. Stephen, The class imbalance problem: a systematic study, Intelligent Data Analysis 6 (5) (2002) 429–450. [42] A. Tanwani, M. Farooq, The role of biomedical dataset in classiﬁcation, in: Proceedings of AMIE: 12th international Conference on Artiﬁcial Intelligence, Springer-Verlag, Berlin, Heidelberg, 2009, pp. 370–374. [43] L. Lei, N. Wu, P. Liu, Applying sensitivity analysis to missing data in classiﬁers, in: Proceedings of ICSSM’ 05, 2005, pp. 1051–1056, ISBN: 0-7803-8971-9 (IEEEXplore). [44] RSES (Rough Set Exploration System) 2.2: http://logic.mimuw.edu.pl/∼rses, 2005. [45] P. Clark, T. Niblett, The CN2 induction algorithm, Machine Learning 3 (4) (1989) 261–283. [46] B. Boutsinas, G. Antzoulatos, P. Alevizos, A novel classiﬁcation algorithm based on clustering, in: In the First International Conference From Scientiﬁc Computing to Computational Engineering, Athens, Greece, 2004. [47] Z. Pawlak, Rough sets, International Journal of Computer and Information Sciences 11 (1982) 341–356. [48] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dordrecht, 1982. [49] L. Hyaﬁl, R. Rivest, Constructing optimal binary decision trees is NP-complete, Information Processing Letters 5 (1) (1976). [50] A. Orriols-Puig, E. Bernadó-Mansilla, Evolutionary rule-based systems for imbalanced data sets, Soft Computing 13 (2009) 213–225. [51] J. Ryan Urbanowicz, J.H. Moore, Learning classiﬁer systems: a complete introduction, review, and roadmap, Journal of Artiﬁcial Evolution and Applications (2009), doi:10.1155/2009/736398 (Article ID 736398).

A genetic algorithm-based rule extraction system

A genetic algorithm-based rule extraction system

Recommend Documents