A bi-phased multi-objective genetic algorithm based classifier

A bi-phased multi-objective genetic algorithm based classifier

A bi-phased multi-objective genetic algorithm based classifier Journal Pre-proof A bi-phased multi-objective genetic algorithm based classifier Dipa...

8MB Sizes 0 Downloads 32 Views

A bi-phased multi-objective genetic algorithm based classifier

Journal Pre-proof

A bi-phased multi-objective genetic algorithm based classifier Dipankar Dutta, Jaya Sil, Paramartha Dutta PII: DOI: Reference:

S0957-4174(19)30881-4 https://doi.org/10.1016/j.eswa.2019.113163 ESWA 113163

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

31 August 2019 21 December 2019 22 December 2019

Please cite this article as: Dipankar Dutta, Jaya Sil, Paramartha Dutta, A bi-phased multiobjective genetic algorithm based classifier, Expert Systems With Applications (2019), doi: https://doi.org/10.1016/j.eswa.2019.113163

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

Highlights • We have proposed a Bi-Phased Multi-Objective Genetic Algorithm • We have done data classification by the proposed algorithm • It is a cyclic algorithm • Statistical tests show that the perpormance of the proposed algorithm is comparable

1

A bi-phased multi-objective genetic algorithm based classifier Dipankar Duttaa,∗, Jaya Silb , Paramartha Duttac a University

Institute of Technology, The University of Burdwan, Burdwan, West Bengal, India, PIN 713104, Email address: [email protected] b Indian Institute of Engineering Science and Technology, Shibpur, Howrah, West Bengal, India, PIN 711103, Email address: [email protected] c Visva-Bharati University, Shantiniketan, Birbhum, West Bengal, India, PIN 731235, Email address: [email protected]

Abstract This paper presents a nobel Bi-Phased Multi-Objective Genetic Algorithm (BP M OGA) based classification method. It is a Learning Classifier System (LCS) designed for supervised learning tasks. Here we have used Genetic Algorithms (GAs) to discover optimal classifiers from data sets. The objective of the work is to find out a classifier or Complete Rule (CR) which comprises of several Class Specific Rules (CSRs). Phase-I of BP M OGA extracts optimized CSRs in IF − T HEN form by following Michigan approach, without considering interaction among the rules. Phase-II of BP M OGA builds optimized CRs from CSRs by following Pittsburgh way. It combines the advantages of both approaches. Extracted CRs help to build CSRs for the next run of phase-I. Hence phase-I and phase-II are cyclically related, which is one of the uniqueness of BP M OGA. With the help of twenty one benchmark data sets from the University of California at Irvine (U CI) machine learning repository we have compared performance of BP M OGA based classifier with fourteen GA and non-GA based classifiers. Statistical test shows that the performance of the proposed classifier is either superior or comparable to other classifiers. Keywords: Classification rules mining, Elitist Multi-Objective Genetic Algorithm, Pareto approach, Statistical test

1. Introduction A list of abbreviations used in this paper is written in

1

ADI Adaptive Discretization Intervals ∗ Corresponding

author Email address: [email protected], Mobile No: Dutta) 1 List of Abbreviations

Preprint submitted to Expert Systems and Applications

+91 9832115594 (Dipankar

January 2, 2020

ARri max AR 5 ri min BioHEL BP M OGA Cnf CN N 10CORE Cov CR CSR EF S − RP S 15

GA HIDER ILGA IP 20KEEL LCS max(Ai ) min(Ai ) M LP − BP 25 M OGA M OOP NB N CSR N OC 30 N OM N SGA − II N SLV NV F SU P (Ant ∧ Cons) SU35P (Ant) SU P (Cons) SW OT T Cnf T Cov 40 U CI U CS 10-CV

45

Maximum value of rth feature gene in ith chromosome Minimum value of rth feature gene in ith chromosome Bioinformatics-oriented Hierarchical Evolutionary Learning Bi-Phased Multi-Objective Genetic Algorithm Confidence Center Nearest Neighbor CO-evolutionary Rule Extractor Coverage Complete Rule Class Specific Rules Evolutionary Feature Selection for fuzzy Rough set based Prototype Selection Genetic Algorithm HIerarchical DEcision Rules Incremental Learning with GAs Initial Population Knowledge Extraction Evolutionary Learning Learning Classifier System Largest value of ith feature of the training data set Smallest value of ith feature of the training data set Multilayer Perceptron with Backpropagation Multi-Objective GA Multi-Objective Optimization Problem Na¨ive Bayes Number of CSRs Number of Cover Number of Match Non-dominated Sorting GA-II New Structural Learning algorithm in Vague environment Number of Valid Features Support of the antecedent and consequence Support of the antecedent Support of the consequence Strengths, Weaknesses, Opportunities, and Threats Total Confidence Total Coverage University of California at Irvine sUpervised Classifier System 10 fold Cross Validation

Designing classifier is one of the matured fields of data mining. A common approach in data mining is to extract rules using evolutionary algorithms. Then, by those rules, we can build classifiers. However, the existing classifiers building algorithms have some shortcomings. 3

50

55

60

65

70

75

80

85

90

1. Some of them use local optimization algorithms. 2. Some explore global search algorithms, but they are lacking the local search ability. 3. Many GA based classifiers used either binary or real encoding. 4. Many consider classification rule mining as a single-objective optimization problem. 5. Many extract either Class Specific Rules (CSRs) or Complete Rules (CRs). 6. Some of them do not have any mechanism to control the number of CSRs in CRs. 7. In most of the cases, Optimization of CSRs and CRs are not cyclically related with each other. 8. Many require separate runs of the rule extraction algorithm to extract CSRs for different classes. 9. In many algorithms, users have to specify various threshold values to eliminate unnecessary CSRs. 10. Many classifiers do not deal with missing feature values. 11. Many classifiers deal with only one type of features (continuous or categorical). Objectives of this work is to build a classifier by overcoming these drawbacks. Most of the GA based methods for building classifiers use single-phased GA like Non-dominated Sorting GA-II (N SGA − II) (Deb et al., 2002). Our classification rules extraction process is bi-phased. For a bi-phased process, existing single-phased GAs are not suitable. So we have built a novel BP M OGA. Here, we used it for classification rule extraction, although it can be used for other purposes. Rule induction methods in data mining for pattern discovery are mostly based on local heuristic search techniques (Chiu and Hsu, 2005), but local search methods are often trapped in local optima and they are sensitive up on the choice of initial solution. GA is successful to discover the optimum classification rules from the data sets. GAs are population based global search algorithms and it can overcome the trap at local optima, hence robust. Apart from that, GA do better feature interaction than the greedy rule induction algorithm (Freitas, 2003). So, we have used GA here. However, GA cannot give a guarantee of getting a global optima, one of the major drawback of GA is its computation cost, which grows with the increase of search space. For example, to compute each objective value of CSR, every record of the training data sets have to be tested at least once. Computation cost of combined algorithms (local search with GA) are higher than GA based algorithms. So, in this work, GA is not combined with any local search method. By preprocessing of data, special chromosome encoding and genetic operator (mutation at phase-I of BP M OGA) we have done local search in a restrictive manner. Phase-I of BP M OGA uses real coded genes for continuous features of data sets. The real coding is more appropriate with real domains, simply because it is 4

95

100

105

110

115

120

125

130

135

more natural to that domain (Aguilar-Ruiz et al., 2003). For categorical features it uses binary coded genes. Phase-II of BP M OGA uses binary coded chromosomes. Binary coding is appropriate for binary domain. Thus, we have taken the advantages of both types of chromosome encoding. Fuzzy classifications rules are more understandable for human and it can deal with the uncertainty but the complexity of fuzzy classification rule extraction is more than crisp classification rule extraction. Therefore, in this work, we have used a combination of binary and real coded GA to extract crisp classification rules. The algorithm can be easily modified to extract fuzzy classification rules. The Michigan and the Pittsburgh are two approaches of GA based classification rule extraction. K. De Jong has written a brief general description (De Jong, 1988) of these two approaches: “To anyone who has read Holland (Holland, 1992), a natural way to proceed is to represent an entire rule set as a string (an individual), maintain a population of candidate rule sets, and use selection and genetic operators to produce new generations of rule sets. Historically, this was the approach taken by De Jong and his students while at the University of Pittsburgh (Smith, 1980), which gave rise to the phrase the Pitt approach. However, during the same time period, Holland developed a model of cognition (classifier systems) in which the members of the population are individual rules and a rule set is represented by the entire population (Booker, 1982; Holland and Reitman, 1978). This quickly became known as the Michigan approach and initiated a friendly, but provocative series of discussions concerning the strengths and weaknesses of the two approaches.” Two approaches have their own utilities. The Michigan approach (Greene and Smith, 1993; Holland et al., 1989; Noda et al., 1999) (one-rule-per-individual encoding) is known as a partial classification method or nugget discovery process; it seeks to find patterns that represent “strong” descriptions of a specified class, even when a small number of representatives represent that class in the data sets (de la Iglesia et al., 2006). For example, it can identify the cause of rare diseases. Here, a chromosome represents an IF − T HEN rule. Each represents a Class Specific Rule (CSR) as the T HEN part is the class label only. Although a chromosome encoding is simpler and shorter, fitness of CSRs are not always the best indicators for the quality of the discovered rules set. On the other hand, the Pittsburgh approach (several-rules-per-individual encoding) (Bandyopadhyay et al., 2000; Corcoran and Sen, 1994) evolves classifiers which are a combination of a set of rules (CSRs). Interactions among those rules take place during the evolution of GA. It helps to extract classifiers of good quality. Chromosomes encode a set of rules and usually of variable length, hence complex. Besides this, as because the Pittsburgh approach learns iteratively from sets of training data, they can work off-line only, whereas the Michigan approach is designed to work online, but can solve off-line problems as well. In BP M OGA, we have combined the advantages of two approaches. BP M OGA extracts CRs but it do not extract those directly from data sets. Instead of that, it builds CRs from CSRs. In most of the cases, number of CSRs are less than the number of data points of data sets. So search spaces 5

140

145

150

155

become small. Extracted CRs guide the fine tuning of CSRs. Some rule learning algorithms take the best CSR of any class after each run of GA. So to build a classifier, minimum number of required runs of GA are equal to the number of classes of data sets. These types of algorithms are classified as iterative rule learning algorithm (Venturini, 1993; Aguilar-Ruiz et al., 2007). One of the advantages of BP M OGA based classifier is that unlike the iterative rule learning it can extract CSRs for the different classes simultaneously in a single run. Initially GA was capable to optimize one objective and hence was known as single-objective GA. So researchers use that for classification rule mining by considering it as single-objective optimization problem (Fidelis et al., 2000; G¨ undo˘ gan et al., 2004; Bandyopadhyay et al., 2000; Corcoran and Sen, 1994), but classification rule mining is a Multi-Objective Optimization Problem (M OOP ) as any classification rule must be accurate and as well as general. In this work, we have considered classification rule mining as a M OOP and used Multi-Objective GA (M OGA) to solve that. David Schaffer proposed vector evaluated GA (Schaffer, 1984, 1985), which was the first step towards framing M OGA. Three different approaches to deal with M OOP s are: 1. Weighted sum approach (Jutler, 1967) 2. The lexicographical approach (Fourman, 1985) 3. The Pareto approach (Pareto, 1972)

160

165

170

175

180

In data mining works, the weighted sum approach is most commonly used. Predominantly it is an ad-hoc approach to solve M OOP s, whereas more systematic approaches are the lexicographical approach and the Pareto approach (Freitas, 2004). Here, we have used the Pareto approach. When all features of the data sets are taking part in the classification rule formation, the size of all chromosomes representing antecedent part of the IF < Antecedent > T HEN < Consequent > rules are of the same length. When selected features take part in the classification rule formation, length of chromosomes are not same. To make all chromosomes of the same length, most GAs use “do not care” condition in place of values for features which are not taking part in chromosome formation. In order to deal with variable length chromosomes, special genetic operators are required. We have taken the advantages of both fixed and variable length chromosomes. Phase-I of BP M OGA uses “do not care” condition in place of values for features which are not taking part in chromosome formation. Thus, length of chromosomes become fixed. For any run of the phase-II, length of chromosomes are fixed, but it varies for different run of phase-II. The task of generating optimized CSRs and CRs are M OOP s, so we have used M OGAs in both phases of BP M OGA. Thus, we have not considered classification rule mining as a single-objective optimization problem, nor converted multiple objectives into one objective. BP M OGA is doing Multi-Objective optimization in its two phases. In phase-I, CSRs are optimized by three measures: 6

1. Confidence, 2. Coverage and 3. Number of valid features in a CSR. 185

Whereas, in phase-II, optimization measures of a CR are 1. Total Confidence, 2. Total Coverage and 3. Number of CSR in a CR.

190

195

200

Maximization of first two measures and minimization of the last measures are the objectives of M OGA in both phases. We have already mentioned that in the Michigan approach (one-rule-perindividual encoding) (Greene and Smith, 1993; Holland et al., 1989), a chromosome represents a CSR or an IF < Conditions > T HEN < Class > rule. Although the individual encoding is simpler and syntactically shorter, the fitness of CSRs are not necessarily the best indicator for the quality of the discovered rules set. The Pittsburgh approach (several-rules-per-individual encoding) (Smith, 1980) has the advantage of judging rule interactions by considering its rule set (CR) as a whole. A hierarchical set of CSRs constitutes a CR. IF < Conditions > T HEN < Class > ELSE IF < Conditions > T HEN < Class > ELSE IF < Conditions > T HEN < Class > −−−−−−−−−−−−−−−−−−−−−−−−−− ELSE < Def ault Class >

205

210

215

220

225

However, this approach makes the chromosome encoding more complicated and syntactically cumbersome, and usually needs more complex genetic operators. It also enlarges the search space and time to find a good solution. To overcome disadvantages and combine the advantages of both approaches we developed a novel Bi-Phased Multi-Objective Genetic Algorithm (BP M OGA). B. de la Iglesia et al. (de la Iglesia et al., 2006) identified the need of phase-II, but they left that for future research. Phase-I of BP M OGA extracts CSRs by following the Michigan approach and phase-II builds CRs from CSR. In phase-I, for extraction of CSRs as per the Michigan approach (one-rule-per-individual encoding) requires simpler encoding of chromosomes. Phase-II builds CRs from CSRs, instead of building CRs from the training data set. Numbers of CSRs are normally less than the number of data points in the training data sets. So, search spaces of phase-II becomes smaller for BP M OGA. CRs are encoded by simple binary coded chromosomes. Thus, BP M OGA tries to capture the advantages and minimizes the drawbacks of both approaches. Number of CSRs in CRs may increase if there is no control. This phenomenon is known as Bloat (Langdon, 1997) - one of the disadvantages of the Pittsburgh approach. Sometimes the Pittsburgh approach produces CRs bigger 7

230

235

240

245

250

255

260

265

270

than necessary, contradicting Occam’s razor principle - “The simplest explanation for some phenomenon is more likely to be accurate than more complicated explanations”. J. At the end of phase-II, we have done Bloat control by reducing the number of CSRs in CRs. This reduces the length of CRs and generalizes CRs. Unlike some LCSs, BP M OGA based classifier have avoided any user specified threshold values to eliminate subjectivism and ensure objectivism. It may be noted that the runtime parameters required by BP M OGA are different from user-specified threshold values. Some values of the data sets may be missing. Deleting data points having missing values or substituting missing values with some values at the preprocessing phase are common approaches in GA based classification rule mining, because those algorithms are not dealing with missing feature values. The proposed BP M OGA based classifier can deal with missing feature values in data sets. Deleting data points with missing feature values or substitution of those with some values are not required for BP M OGA. Some LCSs are suitable for either continuous or categorical types of features. When algorithms are suitable for categorical features, continuous features have to be converted into categorical features. Similarly, when algorithms are suitable for continuous features, categorical features have to be converted into continuous features. Due to hybrid chromosome encoding scheme of phase-I, BP M OGA can deal with both types of features without any conversion. All features of data sets are not equally important to build rules of classifiers. We have proposed an entropy based method in this paper, which can select features for rule formation probabilistically based on their importance. So, BP M OGA based classifier has several advantages. We have used a global optimization algorithm with the local search ability. So, combination of separate local search algorithm with GA is not required. In both phases of BP M OGA, we have used M OGA considering the problems as M OOP instead of converting M OOP in to single-objective optimization problems. Phase-I of BP M OGA extract the optimized CSRs by following the Michigan approach, then phaseII build the optimized CRs from CSRs by following the Pittsburgh approach. At the end of phase-II we are doing Bloat control to reduce complexity of CRs and to generalize CRs. To achieve better classification accuracy, CSRs which are part of the optimized CRs should be readjusted. Thus, BP M OGA tries to capture advantages of both approaches and minimize the drawbacks of both approaches. Singlephased M OGA is not suitable for this. BP M OGA executes in two phases alternately so that the output of one phase becomes the input of another phase in a cyclic manner. Other advantages of BP M OGA based classifier are: it can extract CSRs for different classes simultaneously, it does not need any user specified threshold values, it can work with missing feature values, it can work with continuous 8

275

and categorical features and in BP M OGA, features are forming rules according to their importance in classification. We organized the remaining part of the paper in this way. In Section 2, the work related to classification rule mining by GA is presented. Section 3 describes the approach towards solving the problem. Section 4 shows testing and comparison of the proposed algorithm with others, using 21 benchmark data sets. Finally, section 5 concludes the paper with ending remarks. 2. Related work

280

285

290

295

300

305

310

315

Chiu and Hsu propose a constraint-based GA approach of rule discovery, which applies hybrid technique (local search with GA) (Chiu and Hsu, 2005). Researchers combine local search methods with GA in (Ishibuchi and Yamamoto, 2004; Zhong-Yang et al., 2004). Computation cost of combined algorithms (local search with GA) are high. In early, in “classic” GAs or binary coded GA, chromosome’s binary strings represent individual solutions. Initial works on classification rule mining is done by using binary coded single-objective GA (De Jong et al., 1993; Janikow, 1993; Smith, 1980). For binary domain their performances are good. In these cases, anyone can use standard “off the shelf” GA operators by doing small changes (if any). In contrast to the binary coded GA, the real coded GA uses real values as parameters of the chromosome. For real coded GAs, binary coding and decoding are not required. In binary coded GA, to achieve high precision one has to use more bits to encode chromosomes, but it increases the cost and overhead. Another problem in the binary coded GA is that a long string occupies the computer memory even though only a few bits are involved in the crossover and mutation operations. To overcome the problem of these inefficient coding and genetic operators, researchers use crossover and mutation operators on real coded chromosomes in (Corcoran and Sen, 1994). To deal with uncertainty, real coded GAs (Ishibuchi et al., 1994, 1995) extract fuzzy classification rules. To do parameter tuning of classifier and to obtain an optimized fuzzy rule set in (Sanz et al., 2013, 2014, 2015) Genetic tuning is used. The work of S. W. Wilson (Wilson, 1994) starts the era of the Learning Classifier System (LCS). He proposes a classifier called zeroth level classifier based on his own work (Wilson, 1986) and the work of J. H. Holland (Holland, 1976). Zeroth level classifier use binary coded chromosomes and follows the Michigan approach. Then, S. W. Wilson develops a new classifier XCS, which tends to evolve classifiers that are maximally general, subject to an accuracy criterion (Wilson, 1995). E. B. Mansilla proposes a new LCS called U CS (sUpervised Classifier System) (Bernad´ o, 2002; Bernad´ o-Mansilla and Garrell-Guiu, 2003). XCS is developed based on reinforcement learning scheme, whereas U CS is developed based on supervised learning scheme. Fuzzy-U CS (Orriols-Puig et al., 2009) extracts fuzzy classification rules following the Michigan style LCS whereas zeroth level classifier, XCS and U CS extract crisp classification rules. Large numbers of extracted rules are one of the disadvantages of these classifiers. M. V. Fidelis et al. present a classification algorithm (Fidelis et al., 2000) based 9

320

325

330

335

340

345

350

355

360

on single-objective GA that discovers comprehensible IF − T HEN classification rules. Single-objective GA discovers only a few interesting, high-level rules (termed as a nugget) without trying to discover a complete classification knowledge (Freitas, 1999, 2013). K. K. G¨ undo˘ gan el al. use binary and other types of encoded chromosomes, non-random Initial Population (IP ) generation, uniform operator to find similar types of rules using single-objective GA (G¨ undo˘ gan et al., 2004). E. G. Mansoori develops a steady-state single-objective GA for extracting fuzzy classification rules named SGERD (Mansoori et al., 2008). S. Guan and F. Zhu applies single-objective GA as basic learning algorithm for incremental learning of learning classification rules (Guan and Zhu, 2005). In (Pal et al., 1998), single-objective GA is used to find out a set of optimized hyper-planes (represented by strings in chromosomes) in the feature space that gives minimum misclassification. In (Bandyopadhyay et al., 2000; Corcoran and Sen, 1994), researches use Pittsburgh approach. Recently, Hanley et al. propose a new evolutionary approach for discovering causal rules in complex classification problems from batch data (Hanley et al., 2019). Another recent work by following the Michigan style is (Castellanos-Garz´ on et al., 2019), which have used genetic programming to extract classification rules from medical data. However, after this, a third approach called iterative rule learning emerged (Venturini, 1993), where each run of GA selects best chromosome from the population. Individual chromosomes are encoded by following the Pittsburgh approach, although the final solution is the concatenation of the rules like the Michigan approach, obtained by running the GA several times. An evolutionary algorithm, HIerarchical DEcision Rules (HIDER) is propose in (AguilarRuiz et al., 2003), where rules are sequentially obtained. HIDER with “natural coding” is proposed in (Aguilar-Ruiz et al., 2007). Bioinformatics-oriented Hierarchical Evolutionary Learning (BioHEL) applies an iterative rule learning approach explicitly designed to handle large-scale data sets (Bacardit and Garrell, 2007; Bacardit et al., 2007, 2009). For development of classifier with continuous features, an effective and reduced set of cut points are required. For that different discretization algorithms are invented. One such algorithm, Adaptive Discretization Intervals (ADI) rule representation using the Pittsburgh approach is proposed in (Bacardit and Garrell, 2003), which is further analyzed in (Bacardit and Garrell, 2004). Controlling the length of the chromosomes of LCS is discussed in (Bacardit and Garrell, 2007). BioHEL is strongly influenced by the GAssist (Bacardit and Garrell, 2004). L. Castillo et al. introduce a simplicity criteria in the selection of fuzzy rules in the genetic fuzzy learning algorithm called structural learning algorithm in vague environment (Castillo et al., 2001). Human can easily understand simple description, hence it is preferable than complex description. T. Weijters and J. Paredis propose (Weijters and Paredis, 2002) an intermediate strategy to find out the rules. They combine a sequential covering strategy and a global search strategy of GA to find out the next promising rule. M. Setnes and H. Roubos developed a two-stepped process. First it applies a fuzzy K-Means clustering algorithm to get a compact initial rule-base and at the second step this is optimized by a real coded GA (Setnes and Roubos, 2000). Franco et al. compares GAssist 10

365

370

375

380

385

390

395

400

405

and BioHEL, they apply two different learning paradigms (the iterative rule learning approach and the Pittsburgh approach, respectively) (Franco et al., 2013). Minnaert et al. analyzed several sequential rule induction algorithms to demonstrate the importance of heuristic rule evaluation functions (Minnaert et al., 2015). At first, classification rule mining is done by single-objective GA. In (Fidelis et al., 2000; G¨ undo˘ gan et al., 2004), researchers discuss works based on the Michigan approach while in (Bandyopadhyay et al., 2000; Corcoran and Sen, 1994), single-objective GA is used following Pittsburgh approach. After the invention of M OGA, researchers have started to do classification rule mining by that. To do classification rule mining by M OGA H. Ishibuchi et al. use binary coded chromosomes (Ishibuchi and Yamamoto, 2004; Ishibuchi et al., 2008) whereas some researchers use real coded chromosomes (Dehuri and Mall, 2006; Dutta and Dutta, 2011). Z. Michalewicz and S. J. Hartley (Michalewicz and Hartley, 1996) have pointed out that for real valued numerical optimization problems, floating-point representations outperform binary representation because they are more consistent, precise and lead to faster execution. S. Dehuri and R. Mall have done classification rule mining by simple GA , niched Pareto GA and improved niched Pareto GA (Dehuri and Mall, 2006). Experimental results establish the superiority of improved niched Pareto GA among three algorithms. S. Dehuri el al. have also proposed an elitist M OGA for mining classification rules by optimizing predictive accuracy, comprehensibility and interestingness of the rules (Dehuri et al., 2008). B. de la Iglesia et al. have used different Multi-Objective metaheuristics, such as crowding mechanism, the concept of rule similarity to obtain interesting rule sets (de la Iglesia et al., 2005). They have done the partial classification by a modified N SGA−II (de la Iglesia et al., 2006). Urbanowicz et al. have analyzed the Michigan style LCSs in rule population for global knowledge discovery (Urbanowicz et al., 2012). Nazmi et al. propose a multi-label classification with weighted labels by following Michigan style (Nazmi et al., 2017). When selected features take part in the classification rule formation, then some single-objective GAs use variable length chromosomes (Bandyopadhyay et al., 2000). Hyperplanes do not cover some feature space (Bandyopadhyay et al., 2000). This is a way of doing feature selection. To keep the length of chromosomes same, padding is necessary for variable length chromosomes. In our work (Dutta and Dutta, 2011), for any particular generation of GA, length of chromosomes are fixed but length of chromosomes varies from generation to generation. So, special genetic operators are not required, because in any generation all chromosomes are of the same length. At different generation of GA, different sets of features form chromosomes, so their lengths are different. In (Dehuri and Mall, 2006; Fidelis et al., 2000; Greene and Smith, 1993; G¨ undo˘ gan et al., 2004; Mansoori et al., 2008; Noda et al., 1999), although each chromosome has a fixed length, the genes are “interpreted” in such a way that the individual phenotype (the rule) is of variable length. As we combine the Pittsburgh and the Michigan approach in BP M OGA, so, we should discuss some of the important works which follow the Pittsburgh 11

410

415

420

425

430

435

440

445

450

approach (Smith, 1980). S. Bandyopadhyay et al. propose a GA with variable length chromosomes and variable hyper-planes, which can encode a variable number of hyper-planes to overcome the problem of over-fitting or under-fitting of their earlier work (Pal et al., 1998) with a pre-specified number of hyperplanes. They use class accuracy as one objective of M OGA. Here class accuracy is obtained by multiplying the individual class accuracies. By this they have converted multiple objectives into a single objective (Bandyopadhyay et al., 2004). In all of these works (Bandyopadhyay et al., 2000, 2004; Pal et al., 1998) hyperplanes enclose different regions as different class regions which is equivalent of forming CRs. By using M OGA, M. Kaya describes (Kaya, 2010) a method of designing autonomous classifiers based on three objectives - understandability, classification accuracy, and average support value. A novel, Fuzzy rule based classifiers for financial data classification tasks are designed using a popular Multi-Objective evolutionary optimization algorithm N SGA − II (Gorzalczany and Rudzi´ nski, 2016). F. Rudzinski presents a Multi-Objective genetic approach to design interpretability-oriented fuzzy rule-based classifiers (Rudzi´ nski, 2016). Researchers hybridize M OGA with tabu search (Ishibuchi and Yamamoto, 2004), association rule extraction (Ishibuchi et al., 2008), structural earning algorithm in vague environment (Narukawa et al., 2005), which are not global optimization techniques. Researchers apply fuzzy K-Means clustering to obtain a compact initial rule-based and at the second step they optimize this model by a real-coded GA (Setnes and Roubos, 2000). To construct a reliable and accurate fault diagnosis method, Multilayer perceptron neural network, radial basis function neural network and K nearest neighbor are combined with GA to build classifiers (Lei et al., 2010). Wavelet is used to extract discriminative features as input for fuzzy rule generation by GA (Nguyen et al., 2015). A sparse fuzzy inference systems termed as GenSparseF IS is obtained by embedding genetic operators and mimetic (sparse) fuzzy modeling approach (Serdio et al., 2017). Two-phased Multi-Objective genetic local search algorithms for selection of fuzzy IF − T HEN rules are proposed in (Ishibuchi and Yamamoto, 2004; Ishibuchi et al., 2008). Researchers have combined the two fuzzy genetic based machine learning algorithms into a single hybrid algorithm. In their hybrid algorithm, they generate good fuzzy rules by following the Michigan approach and they find good combinations of generated fuzzy rules by following the Pittsburgh approach (Ishibuchi et al., 2005). They have also shown interpretability accuracy tradeoff in fuzzy rule-based classifiers using a multi-objective fuzzy genetics based machine learning algorithm (Ishibuchi and Nojima, 2007). Another two-phased multi-Objective evolutionary algorithm for extraction of classification rules are proposed in (Chan et al., 2010). In these works (Chan et al., 2010; Ishibuchi and Yamamoto, 2004; Ishibuchi et al., 2005; Ishibuchi and Nojima, 2007; Ishibuchi et al., 2008), the output of the first phase is the input of the second phase, but the reverse is not true, i.e. these algorithms are not cyclic in nature, whereas BP M OGA is cyclic in nature. A co-evolutionary based classification technique, namely CO-evolutionary Rule Extractor (CORE), is proposed to discover classification rules in (Tan et al., 2006). The proposed 12

455

460

465

470

475

480

485

490

495

CORE co-evolves rules and rule sets concurrently in two cooperative populations. Bacardit and J. M. Garrell have done Bloat control in their work (Bacardit and Garrell, 2007). H. Ishibuchi et al. have applied similar method to reduce the size of CRs (Ishibuchi et al., 2008; Narukawa et al., 2005). In (Ishibuchi and Yamamoto, 2004; Ishibuchi et al., 2008; Narukawa et al., 2005; Zhong-Yang et al., 2004), users have to specify various measures like minimum support, confidence and number of antecedent conditions to remove the CSRs having values lower than that of the specified limit. H. Ishibuchi et al. use a threshold value to limit the number of features used in rule formation (Ishibuchi and Yamamoto, 2004; Narukawa et al., 2005). The user specified threshold values are all decided based on the heuristics. Classification algorithms based on single-objective GA (Dehuri and Mall, 2006; de la Iglesia et al., 2005; Freitas, 2013; Greene and Smith, 1993; Mansoori et al., 2008; Noda et al., 1999; Wilson, 1994) and some M OGA based Classification algorithms discussed in (Bacardit and Garrell, 2007; Bacardit et al., 2009; Bandyopadhyay et al., 2000, 2004; Castillo et al., 2001; Corcoran and Sen, 1994; Dehuri and Mall, 2006; de la Iglesia et al., 2006; Fidelis et al., 2000; G¨ undo˘ gan et al., 2004; Ishibuchi and Yamamoto, 2004; Ishibuchi et al., 2008; Kaya, 2010; Narukawa et al., 2005; Zhong-Yang et al., 2004) do not deal with the missing feature values in the data sets. Researchers have deleted the data points with missing feature values (de la Iglesia et al., 2006; Fidelis et al., 2000; G¨ undo˘ gan et al., 2004) or they have substituted missing values before running the classification algorithms (Bacardit and Garrell, 2007; Orriols-Puig et al., 2009). In our earlier work (Dutta and Dutta, 2011), we have shown how to deal with missing feature values in M OGA based classification rule extraction method. Some previous algorithms (Bernad´ o-Mansilla and Garrell-Guiu, 2003; Chan et al., 2010; Chiu and Hsu, 2005; Guan and Zhu, 2005; Orriols-Puig et al., 2009; Tan et al., 2006) can deal with missing feature values in data sets. Many earlier works are suitable for either continuous or categorical types of features (Bandyopadhyay et al., 2000, 2004; Corcoran and Sen, 1994; Dehuri and Mall, 2006; Ishibuchi and Yamamoto, 2004; Ishibuchi et al., 2008; Kaya, 2010; Mansoori et al., 2008; Narukawa et al., 2005; Noda et al., 1999; Pal et al., 1998; Zhong-Yang et al., 2004). Some earlier works (Bernad´ o-Mansilla and GarrellGuiu, 2003; de la Iglesia et al., 2006; Fidelis et al., 2000; Greene and Smith, 1993; G¨ undo˘ gan et al., 2004; Orriols-Puig et al., 2009) are capable to work with both types of features. Most of the LCSs give equal importance to all features during rule formation, but all features are not equally good for rule formation. The identification of significant features is of major importance to the performance of a variety of LCSs. In (Ishibuchi and Yamamoto, 2004; Narukawa et al., 2005), user choose features for rule formation. In (Xia et al., 2013), researchers introduced a univariate estimation of distribution algorithm technique to BioHEL (Bacardit and Garrell, 2007; Bacardit et al., 2007, 2009) to improve its performance. A. A. Freitas have discussed the use of evolutionary algorithms in data mining and knowledge discovery focusing on classification in (Freitas, 2003). Interested 13

500

505

510

readers may see review papers on LCS (Sigaud and Wilson, 2007; Urbanowicz and Moore, 2009) for further reading. In (Fern´ andez et al., 2010), we can get a comparative study of the genetics based machine learning methods with some classical non-evolutionary algorithms. Mukhopadhyay et al. have done a survey on Multi-Objective evolutionary algorithm in (Mukhopadhyay et al., 2014a,b). In (Fern´ andez-Delgado et al., 2014), authors evaluated 179 classifiers arising from 17 families by using data sets from the U CI machine learning repository (Asuncion and Newman, 2007). A recent survey of evolutionary algorithms using metameric representations can be found in (Ryerkerk et al., 2019). 3. Bi-Phased Multi-Objective Genetic Algorithm (BP M OGA) BP M OGA executes in two phases - phase-I and phase-II. Phase-I and phaseII and their relationship are described below. 3.1. Phase-I Phase-I of BP M OGA consists of the following steps:

515

520

1. 2. 3. 4. 5. 6. 7. 8.

Population initialization Crossover Mutation Combination Chromosome filtering Computation of objective values Selection of Pareto chromosomes Generation of intermediate population

We have presented each step below and flow diagram of phase-I is shown in Figure 1. 525

530

535

3.1.1. Population initialization Initial Population (IP ) of phase-I is created by creating a set of chromosomes where every chromosome encodes antecedent part of an IF < Antecedent > T HEN < Consequence > CSR. This step consist of (a) chromosome encoding and (b) feature selection methods. (a) Chromosome encoding: Number of genes in each chromosome is equal to the number of features in the training data sets. Here, we have worked with data sets comprised by continuous and categorical features. Real coding and binary coding are used to encode the genes corresponding to continuous and categorical features, respectively. Any continuous feature is represented by a range of minimum and maximum value in any CSR. Let, ith continuous feature has minimum value ARri min and maximum value ARri max in rth chromosome, therefore corresponding gene is [ARri min , ARri max ]. Categorical features are encoded by a stream of binary 14

Table 1: Student data set

Body temperature 99.4 99.5 98.7 100.0 99.0 99.2 98.9 98.5 98.9 99.8

540

545

550

555

560

565

570

Grade C B A C C B B A A A

Attendance Absent Absent Present Absent Absent Absent Present Present Present Present

bits, where the length of the stream i.e. number of bits is equal to the number of distinct values of the categorical feature in the training data sets. In the stream, bit value ‘1’ at any position indicates that the corresponding categorical feature is active with corresponding value while ‘0’ indicates the feature is inactive with corresponding value. When a feature if not participating in the formation of a CSR, BP M OGA encodes corresponding gene by the symbol # (“do not care” condition). The logical expression of an antecedent is written below. V V V (... (Vixj ∈VARrix V 1 ∨ ARrix 2V∨ ... ∨ ARrix k ) ... (Viyj ≥ ARriy ARriy max ) ... (Vizj = #) ...)

min ∧ Viyj



where the ixth feature is categorical and iy th feature is continuous and iz th is the feature not used in rule formation. By this type of hybrid chromosome encoding we work together with continuous and categorical features. To explain the process we are using a synthetic student data set with three features, shown in the Table 1. Three features are: body temperature in degree Fahrenheit (continuous feature), grade of students (categorial feature) and attendance (class label). In the preprocessing phase, values of all continuous features are sorted (For example, Table 2 is obtained from Table 1 after sorting). Here, we define a notion ‘change point’ to specify where the data points change their class label with respect to each continuous feature. The lowest and highest value of each continuous feature is taken as first and last change points. So, first change point for body temperature is 98.5 and last change point is 100.0. The other change points are midpoints of the corresponding values where class label changes. For example, at body temperature 98.9 class label is ‘Present’ and at body temperature 99.0, class label is ‘Absent’. So the change point is 98.95, which is the midpoint of 98.9 and 99.0. For body temperature the change points are 98.5, 98.95, 99.65, 99.9 and 100.0. Now, a CSR may be as follows: IF 98.95 ≤ Body temperature ≤ 100.0 and Grade C or B THEN Absent. 15

Table 2: Modified student data set

Sorted body temperature 98.5 98.7 98.9 98.9 99.0 99.2 99.4 99.5 99.8 100.0

575

580

585

590

595

600

605

Grade A A B A C B C B A C

Attendance Present Present Present Present Absent Absent Absent Absent Present Absent

Corresponding chromosome is shown in Figure 2. Here A1 min = 98.95 and A1 max = 100.0 for continuous type feature representing body temperature. For gene of categorical feature, ‘1’ in 1st and 2nd positions of the binary string indicate that the categories ‘C’ and ‘B’ of student grade are active, while ‘0’ in 3rd position denotes that the category ‘A’ is inactive. Position of categorical feature values in the respective gene depends on their order of appearance in the training data sets (As per Table 1). So, in this example at 1st , 2nd and 3rd position student grades are ‘C’, ‘B’ and ‘A’ at 1st , 2nd and 3rd position. This hybrid chromosome encoding scheme is based on HIDER (AguilarRuiz et al., 2003), but in HIDER, [ARri min or ARri max ] can take any value of the ith feature of the training data set, whereas in our proposed method [ARri min or ARri max ] can take any values from change points of the ith feature. At the preprocessing phase, we find out change points. As we are looking for the boundaries of classes, so we are limiting our search among change points only. Noise in data could affect change points and may generate some change points. After the end of some generation of BP M OGA, the rules formed by those change points will get eliminated as they will not get sufficient support or coverage (discussed later). We are not applying any discretization algorithms to discretize continuous feature values in data sets before applying BP M OGA. Rather, continuous features are discretized by change points, and BP M OGA is using change points only to specify class boundaries for continuous features. This limits the number of unique CSRs and helps in Bloat control. BP M OGA randomly select two data points from the training data set having the same class label for encoding a chromosome. The chosen data points may have missing feature values. For each continuous feature, there are three possible cases: (a) only one data point has valid feature value, (b) two data points have valid feature values and (c) data points not have valid feature values. For case (a), for ith feature, the change point is chosen as either ARri min or ARri max and the remaining one take # as a value, so that gene of ith feature covers the chosen data point with a valid feature value. For case (b), two change points are taken as ARri min

16

610

615

620

625

630

635

and ARri max , so that it covers the chosen data points. For case (c), both ARri min and ARri max take #. If the ith feature value of a data point falls within the range ARri min and ARri max of the gene, then it is covered by that gene. If gene value is [#, ARri max ] then the range is [−∞, ARri max ] and [ARri min , #] denotes the range as [ARri min , +∞] and # denotes the “do not care” condition. To make the chromosomes more general and simpler, 50% of ARri min or ARri max are changed to # randomly (produced in case (b)). This step has been used for avoiding over-fitting problem and to improves the generalization ability of the CSRs. For each categorical feature, the number of bits in the gene is equal to the number of categories present in the training data set. Two chosen data points for building a chromosome may have different categorical feature values or same value or missing value (denoted by ‘?’). If values are missing in chosen data points the corresponding genes are encoded by #. In case of one or two valid categorical feature values, corresponding bit(s) of the gene(s) is/are assigned ‘1’ and other bit(s) is/are assigned ‘1’ or ‘0’ randomly. This process ensures that the chosen data points will be covered by the chromosome. To build a chromosome, we randomly choose 3rd and 7th data points from Table 1 for which the change points are 98.5 and 98.95 for i = 1. For i = 2, the gene is encoded as ‘011’ i.e. ‘0’ for C, ‘1’ for B and ‘1’ for A. Corresponding chromosome is shown in Figure 3. (b) Feature selection: All features of a data set may not be equally important for classification. In this work, the important features are selected probabilistically which subsequently are used to build the CSRs. We propose here an entropy based feature selection method where chromosomes are encoded using variable number of features. Better features should get higher chance in rule formation. Chromosomes formed by the earlier step are the inputs here. The number of features taking part in rule formation may vary from 1 to (n − 1), where (n − 1) is the number of features in the data set excluding the class label. Without user intervention BP M OGA selects features probabilistically for building CSRs. The process is described below (i) Calculating entropy of a class (E(C)) by equation (1) E(C) = −

640

645

c X

k=1

pk × log2 (pk )

(1)

where c is the number of classes and pk is the probability of the data points belong to k th class. If |T | is the number of data points in the training data set and |Tk | is the number of data points in the training data set belongs to k th class, then pk = |Tk |/|T |. (ii) Calculating the average (weighted) entropy (AE(A)) of feature A using equation (2) AE(A) = −

X

V ∈V alues(A)

17

|TV | × E(AV ) |T |

(2)

650

where V alues(A) represents the set of all possible values of the feature A, |TV | representing the number of instances in the training data sets with value V of feature A, |T | is the number of data points in the training data set and E(AV ) is the entropy of feature A with value V . We use change points toP discretize the K continuous features before calculating entropy. Here E(AV ) = k=1 E(AkV ), k Where E(AV ) is the entropy of feature or attribute A with value V and class k. (iii) Calculating information gain (IG(A)) of all features using equation (3)

th

(iv) Probability of i

IG(A) = E(C) − AE(A)

(3)

feature (P Ai ) taking part in CSR formation is (n−1)

P Ai = IG(Ai )/

X

IG(Ai )

(4)

i=1

655

660

665

670

675

680

where (n − 1) is the number of features. The proposed probabilistic feature selection method ensures that the formed CSRs using the selected features are more effective compared to the CSRs built by all features. Without user intervention or without using user specified threshold values BP M OGA selects features. It also reduces the system complexity. The feature selection method generates a random value between 0 and 1 and if this value is less than P Ai then ith feature is selected in the chromosome formation, otherwise not. A chromosome shown in Figure 3 may be modified to chromosomes shown in Figure 4 and Figure 5. The performance of the GA depends on the size of the Initial Population (IP ). A large IP helps to search large space associated with the large data sets. In the present context, IP size of phase-I is 20, where the training data set has less than or equal to 200 data points, otherwise, the IP size is 10% of the size of the training data set. To build IP , BP M OGA creates the required number of chromosomes by chromosome encoding and feature selection methods. 3.1.2. Crossover For the first generation of the first run of phase-I, IP is the input of crossover and mutation (described next) . From second generation, intermediate populations (created by the method described in section 3.1.8) are inputs of crossover and mutation. Here, the single-point crossover operation do crossover among a pair of chromosomes chosen randomly from the parent population and creates possibly two new chromosomes for the next generation. Each chromosome in the parent population takes part in crossover only once. If parent population has an even number of chromosomes, all chromosomes take part in the crossover. When parent population has an odd number of chromosomes, one residual chromosome is copied in the child population and others take part in crossover. Therefore, here the crossover probability is close to one. The crossover point for each operation is chosen randomly from (n − 1) genes. Among (n − 1) genes, some genes 18

685

690

695

700

705

710

are for continuous features and some are for categorical features. If a gene for continuous feature is randomly chosen for crossover, then BP M OGA chooses the middle position of [ARri min , ARri max ] as crossover point. After crossover, if ARri min is greater than ARri max then they are interchanged. For genes of categorical features, BP M OGA chooses any position of the binary string randomly as crossover point. If the gene which has been chosen for crossover contains #, then crossover is performed at the boundary of the gene (Figure 4 and 5). Figure 6, 7 and 8 show different outputs of crossover operation. 3.1.3. Mutation Here, the mutation probability is 0.1, which implies that 10% genes of the parent population are mutated. At most one gene of any chromosome is mutated. Some chromosomes participate in the mutation operation to build new chromosomes for child population and the rest are copied in the child population. Like crossover, there are various types of mutation. Continuous feature (say, i) takes part in rule formation by gene value ARri min and ARri max . For mutation, either ARri min or ARri max is chosen randomly. Among ARri min and ARri max one may hold valid value and other may has #. In that case, ARri min or ARri max that holds valid value is chosen for mutation. Mutation changes ARri min or ARri max value to the next lower or higher change points. By this BP M OGA is doing a local search. For genes of categorical features, bit values are changed randomly by mutation. If we want to replace # valued genes, we have to put two values for continuous feature ARri min and ARri max and for categorical feature we have to assign several bits randomly. In that case the probability is high that the child chromosome will not cover even a single data point of the training data set. So, mutation is not applied to genes holding the value #. We illustrate mutation operation by Figure 9 and 10. 3.1.4. Combination Chromosomes obtained by crossover, mutation operations and by the previous generations of phase-I (described in section 3.1.7) are combined together. In order to maintain elitism during selection this step is required.

715

720

725

3.1.5. Chromosome filtering The proposed chromosome filtering technique consists of following steps: (a) Eradication of useless genes, (b) Removing invalid chromosomes and (c) Removing duplicate chromosomes. (a) Eradication of useless genes: Previous methods may create some genes with meaningless propositions. Genes of continuous feature may have ARri min = min(Ai ) and ARri max = max(Ai ) values. Here, min(Ai ) and max(Ai ) represent smallest and largest value of ith feature of the training data set. In those cases, those genes cover all data points of the training data sets, therefore such genes are meaningless. BP M OGA modifies these useless genes as described below.

19

• If ARri min = min(Ai ) and ARri max 6= max(Ai ) and ARri then ith gene is modified to [#, ARri max ] • If ARri min 6= min(Ai ) and ARri min 6= # and ARri then ith gene is modified to [ARri min , #] 730

• If ARri to [#] • If ARri [#]

735

740

745

750

755

760

765

• If ARri [#]

min

= min(Ai ) and ARri

max

min

= min(Ai ) and ARri

min

= # and ARri

max

max

max

6= #

= max(Ai )

= max(Ai ) then ith gene is modified

max

= # then ith gene is modified to

= max(Ai ) then ith gene is modified to

Following example will explain the usefulness of the process. The smallest value of ‘body temperature’ for student data set shown in the Tables 1 and 2 is 98.5. For any chromosome, if we got a gene value |98.5, 99.65| for ‘body temperature’ and |#| for ‘Grade’, all data points having ‘body temperature’ value in the range of [98.5, 99.65] will be covered by the rule formed by that chromosome. BP M OGA is changing |98.5, 99.65| to |#, 99.65|. The training data set does not have any value less than 98.5, but testing data set may have values lower than 98.5. So, if the range remains [98.5, 99.65], testing data points having less than 98.5 values will not be covered by the rule. If the range become [−∞, 99.65], then testing data points having less than 98.5 values will be covered by the rule. For any categorical feature, if all bits of corresponding gene are ‘0’, that implies it does not cover any data point of the training data set. However, if all bits are ‘1’ then it covers all data points. Therefore, BP M OGA replaces these genes with #. Chromosome shown in Figure 4 is modified and shown in Figure 11. As 98.5 is the lowest temperature in the student data set, therefore it is replaced by #. (b) Removing invalid chromosomes: If all genes in a chromosome have # values then it implies that the chromosome is not specifying any condition, then it is invalid chromosome and so these type of chromosomes are removed from the population. (c) Removing duplicate chromosomes: To reduce the population size, only unique chromosomes are retained by removing the duplicate chromosomes. By this step populations become more effective. It will increase the change of convergence and will reduce the population size. 3.1.6. Computation of objective values Researchers measure the merits of CSRs by evaluating gain, variance, chisquared value, entropy gain, gini, laplace, lift, and conviction as described in (Bayardo Jr and Agrawal, 1999). According to that study, Pareto-optimal CSRs lying on the font of maximum confidence (accuracy) and maximum coverage (support) are the best rules. So, in phase-I, maximize the confidence 20

770

775

780

and coverage of CSRs are twin objectives of M OGA. In addition, to make the CSR simpler, another objective of phase-I is to minimize the Number of Valid Features (N V F ) in the CSRs. Chromosomes encode the antecedents of the CSRs and depending on that information BP M OGA finds out the consequences of the CSRs. It checks that the feature values of a training data point satisfies all conditions of a chromosome or not (excluding features with missing values and a gene with “do not care”). If it covers, then support of the antecedent (SU P (Ant)) is incremented by 1. If a data point belongs to k th class, then k th counter value is incremented by 1. After checking all data points of data set, the highest counter value is assigned as support of consequence (SU P (Cons)) and the corresponding class label becomes the consequence for the chromosome. Following the same procedure, BP M OGA calculates the support of antecedent and consequent (SU P (Ant ∧ Cons)). Finally, it calculates Confidence(Cnf ) and Coverage(Cov) by using equations (5) and (6) respectively. Conf idence(Cnf ) = Coverage(Cov) =

785

SU P (Ant ∧ Cons) SU P (Ant)

SU P (Ant ∧ Cons) SU P (Cons)

(5) (6)

N V F is equal to the number of valid features in chromosomes. Here two features i.e. ‘Body Temperature’ and ‘Grade’ both have valid values i.e. they do not have ‘don’t care condition’ or ‘#’, so, N V F = 2. The chromosome shown in the Figure 2 builds a CSR as written below IF 98.95 ≤ Body temperature ≤ 100.0 and grade is C or B T HEN absent with Cnf = 1.0, Cov = 1 and N V F = 2.

790

795

800

805

3.1.7. Selection of Pareto chromosomes BP M OGA selects the CSRs of every class lying on the Pareto front of max(Cnf, Cov, 1/N V F ) as Pareto chromosomes. To convert the optimization problem into a maximization problem we consider the maximization of 1/N V F as the third objective of phase-I. The proposed method compares and selects chromosomes of the each class separately. So M OGA of phase-I gives guarantee to select at least one CSR for each class of the training data set. It is applied on the combined rule set, so BP M OGA adheres elitism. Chromosomes which are not dominated by any other chromosomes of the population are known as elite chromosomes. Chromosome A dominates chromosome B if and only if: (1) A is strictly better than B in at least one of all the measures considered and (2) A is not worse than B in any of the measures considered. We are not applying the crowding distance and selecting all rules lying on the Pareto-optimal front. Therefore size of the population varies from generation to generation. Some earlier works (de la Iglesia et al., 2006) used Non-dominated Sorting GA-II (Deb et al., 21

810

815

820

825

830

835

840

845

2002), where researchers applied crowding distance and rules are selected based on the rank to keep the population size fixed. Chromosomes of selected CSRs are combined with other chromosomes of the next generation of phase-I (described in section 3.1.4). With the generations of phase-I the size of the population increases, but as we run 50 generations in one run, the size of the population will not become unmanageable. Chromosomes or CSRs selected by the 50th generation of phase-I, are the input for phase-II of BP M OGA. In phase-II, some CSRs are selected and some are rejected based on their goodness in terms of their interactions in CRs. Thus we are reducing the number of CSRs considering the interactions among them in CRs rather than selecting them on the basis of crowding distance. 3.1.8. Generation of intermediate population From the second generation, for every odd generation, the populations are created from the selected Pareto chromosomes of previous generations of phase-I. This helps to build a population of chromosomes close to the Paretooptimal front, resulting fast convergence of phase-I M OGA. To introduce diversity in the population, every even generation of phase-I introduces new chromosomes by creating populations by initial population creation method. We do not lose any Pareto chromosome found in the earlier generation, as we select chromosomes from a combined population of the Pareto of previous generations and the chromosomes of the present generation. BP M OGA runs 50 iterations of phase-I, which produce CSRs for input to phase-II of BP M OGA. 3.2. Phase-II The task of phase-II is to build optimized CRs by the CSRs produced by phase-I. Phase-II optimizes CRs by satisfying three objectives simultaneously (a) maximizing Total Confidence (T Cnf ), (b) maximizing Total Coverage (T Cov) and (c) minimizing the Number of CSRs (N CSR) in CRs. A good CR must be accurate (T Cnf ) as far as possible i.e. its misclassification rate will be as low as possible, it must has generalization ability, so that it will cover (T Cov) as many data points of the training data set as possible. Its must be simple, i.e. it will contain minimum Number of CSRs (N CSR). Phase-II creates CRs from CSRs and it selects Pareto-optimal CRs using M OGA. The phase-II of BP M OGA consists of the following sequential steps as depicted in Figure 12. 1. 2. 3. 4. 5.

Initialization of population Crossover Mutation Combination Bloat control

22

850

855

860

865

870

875

880

885

6. 7. 8. 9.

Chromosome filtering Evaluation of objective values Selection of Pareto rules Generation of intermediate population

3.2.1. Initialization of population Before building initial population for phase-II, as a preprocessing step, Paretooptimal CSRs generated in phase-I are sorted in descending order based on their confidence values by giving highest importance to the accuracy of CSRs. CSRs with equal confidences are sorted in descending order using their coverage values. In cases of same confidence and coverage, CSRs are sorted in ascending order depending on their N V F values. If all are equal, then they are kept randomly. Sorting is required to build accurate CRs, then CRs must be general and lastly CRs must be simple. The IP of phase-II comprises of 20 binary chromosomes where the length of each chromosome is equal to the number of CSRs extracted by phase-I. By the sorted CSRs, BP M OGA encodes the chromosomes to build the CRs. Bit value ‘1’ at the rth gene indicates that the rth CSR of the sorted CSRs is part of the corresponding CR, while the bit value ‘0’ at the rth gene indicates that the rth CSR is not part of that CR. Suppose a population extracted by phase-I consists of four CSRs. If a CR represented by a binary string ‘1001’ then the 1st CSR and 4th CSR form CR, whereas the 2nd and 3rd CSRs do not take part in CR formation. Here, the probability of selecting any CSR for a CR formation is denoted as Class Rule Probability (CRP ). For this work, CRP value is 0.5. For every gene of chromosome of CR, BP M OGA generates a number randomly in between 0 and 1. If it is less than the CRP i.e. 0.5 then the corresponding bit is set to ‘1’, otherwise ‘0’. Experimentation can be done to decide the optimum value of the CRP . It is to be noted that for a CR being valid and operative, it must have at least one CSR from each class. 3.2.2. Crossover All chromosomes (100%) take part in crossover when the number of chromosomes in the population are even. When the number of chromosomes in the population are odd then all chromosomes except the last one take part in crossover and the last chromosome is copied in the child population. Two chromosomes from the population take part in one single-point crossover and produce two child chromosomes. Crossover points are chosen randomly and one chromosome takes part in crossover only for once. Chromosomes formed by building initial population procedure are parent chromosomes for the first generation of phase-II. From second generation, chromosomes created by the intermediate population creation method of phase-II are parent chromosomes for crossover operation.

23

890

895

900

905

910

915

920

925

3.2.3. Mutation As all chromosomes are binary, simple binary mutations are performed by flipping the bit. Mutation probability for phase-II is set to 0.1. Random numbers between 0 and 1 are generated for each gene of chromosomes and if it is less than the mutation probability then the gene is mutated, otherwise not. Parent populations of mutation are same as parent populations of crossover. 3.2.4. Combination This step combines population from the previous generation of phase-II (selected by the Pareto rules selection method described in section 3.2.8), postcrossover population and post-mutation population. This step is required to ensure elitism preservation during selection. 3.2.5. Bloat control All the sorted CSRs in the CRs are checked to see whether they can cover the data points of the training data sets. If a data point satisfies all conditions of a CSR then it is covered by that CSR and that data point is not considered further for the remaining CSRs of that CR. If any CSR of a CR does not cover any data point as because the previous CSRs have already covered all the data points, then that CSR is eliminated from that CR. This is done by changing the corresponding bit value from ‘1’ to ‘0’. By this way, we reduce the number of CSRs in CRs and simplify CRs. This phenomena is known as Bloat control (Langdon, 1997). 3.2.6. Chromosome filtering Duplicate chromosomes, if obtained after genetic operations, are discarded. As a result of which all the chromosomes become unique and also the population size gets reduced. 3.2.7. Evaluation of objective values We derive the values of T Cnf , T Cov and N CSR with the help of two parameters, N OC (Number of Cover) and N OM (Number of Match). N OC and N OM are initialized to ‘0’. N OC is incremented by one when a data point satisfies all conditions of any CSR in CR. When the class label of a CSR which covers a data point and the actual class label of that data point become same, BP M OGA increments N OM by one. If no CSR within a CR cover a data point, the corresponding class label and a default class label are checked to see whether they are same or different. The default class label is the class label of the largest number of uncovered data points in the training data set. Here uncovered data points mean that they do not satisfy the conditions of any CSRs of CR. In case of match with the default class label, the algorithm increments N OM by one, where N OC remains unaltered. This step is repeated for all the data points of the training data set. After calculating the values of N OC and N OM , we use equations (7) and (8) to calculate T Cnf and T Cov respectively.

24

T Cnf =

N OM m

N OC m where m = number of data points in the training data set. The N CSR value is derived by (9) T Cov =

NCSR = Number of CSRs in CR 930

935

940

945

950

955

(7) (8)

(9)

Now, we have to simultaneously maximize T Cnf and T Cov and minimize N CSR. To convert this optimization problem into maximization problem, we have considered the maximization of 1/N CSR as the third objective of phase-II. Maximization of T Cnf , T Cov and 1/N CSR helps to find out accurate, general and simple CRs respectively. 3.2.8. Selection of Pareto rules BP M OGA selects the CRs lying on the Pareto front of max(T Cnf, T Cov, 1/N CSR) as Pareto chromosomes. The parent population of this method consists of chromosomes selected by Pareto CRs from the previous generation, post-crossover population, and post-mutation population. It ensures propagation of elite chromosomes to the next generation of phase-II. 3.2.9. Generation of intermediate population BP M OGA selects top 20 CRs from the Pareto rules and if Pareto rules are less than 20 then it selects all CRs. All Pareto rules are equally significant if we consider that all the objectives are equally important. Assigning the highest importance to the accuracy of CR, BP M OGA sorts all CRs in descending order of T Cnf values. CRs are sorted on the basis of T Cov values when T Cnf values of CRs are equal. When CRs have equal T Cnf and T Cov values, they are sorted in ascending order with respect to their N CSR values. In case of two or more CRs have equal T Cnf , T Cov and N CSR values, the CRs are arranged randomly. The justification of size 20 of the intermediate population is that we are in need of only one CR for classifying the test data set. From second generation, for every odd generation, the population is made by chromosomes of selected Pareto rules of earlier generations of phase-II. For every even generation, the population is formed by the procedure described in section 3.2.1. The proposed method preserves the elite chromosomes of the earlier generations of phase-II. After 50 iterations of phase-II, the Pareto-optimal CRs are given as input of phase-I.

25

960

965

970

3.3. Fusion of phase-I and II The process of combination of phase-I and phase-II is depicted in Figure 13. Two phases of BP M OGA run in an interleaved fashion. The number of generations in each run is specified by the user. Here we set 50 generations each for phase-I and phase-II, decided experimentally. As the proposed method is bi-phased and two phases of BP M OGA uses M OGA, evolution curve cannot be generated which is a common procedure to show the effectiveness of singleobjective GA. Fusion of two phases has been performed using the steps: 1. 2. 3. 4. 5.

Combination Chromosome filtering Selection of Pareto CRs CSRs extraction from CRs Selection of Pareto CSRs

Now, we will explain these one by one.

975

980

985

990

995

3.3.1. Combination At this step, BP M OGA combines chromosomes produced by present and previous run of phase-II. The combination process combines four CRs after second run of phase-II as shown in Figure 13 - CR11 , CR12 , CR21 and CR22 . This step is required to preserve elite CRs selected at different runs of phase-II. 3.3.2. Chromosome filtering Two consecutive runs of phase-II may create multiple same chromosomes. At this step BP M OGA selects unique chromosomes only. It may be noted that two such chromosomes may have the same binary representation despite they have different CSRs. nth run of phase-I may produces three CSRs, such as CSRn1 , CSRn2 and CSRn3 , then nth run of phase-II may produces a CR with chromosome ‘111’, denotes as CRn1 . Let, (n + 1)th run of phase-I produces three CSRs like CSR(n+1)1 , CSR(n+1)2 and CSR(n+1)3 that are different from CSRn1 , CSRn2 and CSRn3 . Now (n + 1)th run of phase-II may produce a CR with chromosome ‘111’ - denoted as CR(n+1)1 . Although binary strings of CRn1 and CR(n+1)1 are same, they are build by different CSRs. So, CRn1 and CR(n+1)1 are not same and BP M OGA selects both for the population. By this step BP M OGA reduces the number of extracted CRs. 3.3.3. Selection of Pareto CRs BP M OGA selects CRs lying on the front of max(T Cnf, T Cov, 1/N CSR) as Pareto CRs. To ensure preservation of good CRs of the previous run of phaseII, Pareto CRs from two consecutive runs of phase-II are combined together before selection. Two consecutive runs of phase-II produce two populations of Pareto optimal CRs. In the combined population one CR may be dominated by CR of another population. Therefore, by this step BP M OGA reduces 26

1000

1005

1010

1015

1020

1025

1030

1035

1040

the population size after combination and simultaneously preserve the elite chromosomes. 3.3.4. CSRs extraction from CRs BP M OGA extracts all unique CSRs from all Pareto CR. The CSRs extracted through phase-I have a chance of taking part in the CR formation in phase-II. In phase-II all CRs are modified through Bloat control (discussed in section 3.2.5). Now the CSRs which are not part of any CRs are eliminated. They are last few CSRs of the sorted list of CSRs with high probability. In this way, Pareto CRs extracted by phase-II, guide the selection of CSRs for the next run of phase-I. Instead of selecting CSRs by crowding distance without rule interaction, we are selecting CSRs considering rule interaction within CRs. This also justify the requirement cyclic nature of BP M OGA. 3.3.5. Selection of Pareto CSRs Within Pareto CRs extracted by different run of phase-II may have same CSRs. Some dominated CSRs may be present among the CSRs extracted by the previous method. So at this step BP M OGA selects the CSRs of every class lying on the Pareto front of max(Cnf, Cov, 1/N V F ). The CSRs of the same class are compared and the selected CSRs work as input to the next run of phase-I. From second run of phase-I, these CSRs build initial population. This also justifies our claim that phase-I and phase-II are cyclically related. One interesting point to be noted here is that we do not combine Pareto CSRs of two consecutive runs of phase-I before the selection process. CSRs which are part of any Pareto CR are selected and other CSRs are eliminated. This reduces the number of CSRs. The process of selecting rules by BP M OGA is shown in the Figure 14. Notice that phase-I is selecting all non-dominated CSRs of all classes lying on the Pareto-optimal front. In the Figure 14, 15 CSRs from three classes are lying the Pareto-optimal fronts of the respective classes. Number of CSRs from every class need not to be equal. Among 15 CSRs, 11 CSRs are part of non-dominated CRs lying on the Pareto-optimal front. Other CSRs may be part of dominated CRs. So, at the end of phase-II, 11 CSRs are selected and those are input to the next run of phase-I. Phase-I and phase-II run 20 times in a phased manner. As already mentioned, one run of phase-I or phase-II means 50 generations of phase-I or phaseII. So, 1000 (20 × 50) generations of phase-I and phase-II runs are conducted. 3.4. Selection of CR for testing The final run of phase-II produces some near Pareto-optimal CRs (i.e., CRs lying near to the Pareto front) out of which we have to select one CR. The choice is unique in the case of one CR at the Pareto front, otherwise the CR with the highest T Cnf is chosen. In case the choice is not unique, i.e. there exist more than one non-dominated CR with highest T Cnf value, we select the CR corresponding to the highest T Cov among them. If more than one CR with

27

1045

1050

highest T Cnf value and same T Cov value, then we select CR with minimum N CSR value among them. In case we have more than one CRs with same T Cnf , T Cov and N CSR values, we pick up any at random. While selecting final CR we are giving highest importance to accuracy, then coverage and then simplicity. For example, one CR for the student data set may be If 98.95 ≤ Body temperature and Grade C or B then absent else if Grade B or A then present here the rule for default class was not mentioned. 4. Testing and comparison

1055

1060

1065

1070

1075

1080

In this section, we have compared the performance of BP M OGA with the seven evolutionary crisp rule learning algorithms and seven non-evolutionary classifiers. Evolutionary crisp rule learning algorithm based classifiers encompass - HIDER (HIerarchical Decisions Rules) (Aguilar-Ruiz et al., 2003, 2007), U CS (sUpervised Classifier System) (Bernad´ o-Mansilla and Garrell-Guiu, 2003), GAssist − ADI (GA based Classifier System - Adaptive Discretization Intervals) (Bacardit and Garrell, 2003, 2004), GAssist−Intervalar (GA based Classifier System with Intervalar Rules) (Bacardit and Garrell, 2007), CORE (COEvolutionary Rule Extractor) (Tan et al., 2006), ILGA (Incremental Learning with GAs) (Guan and Zhu, 2005) and BioHEL (Bioinformatics-oriented hierarchical evolutionary learning) (Bacardit et al., 2009). Classifiers from nonevolutionary categories with various knowledge representation techniques include - M LP −BP (Multilayer Perceptron with Backpropagation) (Rojas, 2013) - an artificial neural network based classifier, C4.5 rules (Quinlan, 2014)- decision tree based algorithm C4.5, IBk (Aha et al., 1991) - a nearest neighbor algorithm, N B (Na¨ive Bayes) (Domingos and Pazzani, 1997) - a probabilistic classifier using Bayesian model, EF S −RP S (Evolutionary Feature Selection for fuzzy Rough set based Prototype Selection) (Derrac et al., 2013) - a rough set based evolutionary classifier, CN N (Center Nearest Neighbor) (Gao and Wang, 2007) - a center-based nearest neighbor classifier and N SLV (New Structural Learning algorithm in Vague environment) (Garc´ıa et al., 2014)- a fuzzy rule learning algorithm. All methods use KEEL (Knowledge Extraction Evolutionary Learning) (Alcal´ a-Fdez et al., 2008) and we have followed the recommended parameter values recommended by KEEL as mentioned in the Tables 4 and 4 (Alcal´ a-Fdez et al., 2008). We did not adjusted any parameter of any algorithm (including BP M OGA) for achieving better result, rather we have taken the parameter values given in KEEL by the respective developers of the algorithms, to ensure fairness in comparison. Table 3: Parameter values of crisp rule learning algorithms. HIDER

GAssist − ADI

U CS

28

populationSize = 100 nGenerations = 100 mutationProbability = 0.5 crossPercent = 80 extremeMutationProbability = 0.05 pruneExamplesFactor = 0.05 penaltyFactor = 1 errorCoeficient = 0

numberOfExplores = 100000 popSize = 6400 delta = 0.1 nu = 10.0 acc 0 = 0.99 pX = 0.8 pM = 0.04 theta ga = 50.0 theta del = 50.0 theta sub = 500.0 doGASubsumption = true typeOfSelection = RWS tournamentSize = 0.4 typeOfMutation = free typeOfCrossover = 2PT r 0 = 0.6 m 0 = 0.1

GAssist − Intervalar hierarchicalSelectionThreshold = 0 iterationRuleDeletion = 5 iterationHierarchicalSelection = 24 ruleDeletionMinRules = 12 sizePenaltyMinRules = 4 numIterations = 500 initialNumberOfRules = 20 popSize = 400 probCrossover = 0.6 probMutationInd = 0.6 probOne = 0.90 tournamentSize = 3 numStrata = 2 discretizer1 = UniformWidth 4 discretizer2 = UniformWidth 5 discretizer3 = UniformWidth 6 discretizer4 = UniformWidth 7 discretizer5 = UniformWidth 8 discretizer6 = UniformWidth 10 discretizer7 = UniformWidth 15 discretizer8 = UniformWidth 20 discretizer9 = UniformWidth 25 discretizer10 = Disabled maxIntervals = 5 probSplit = 0.05 probMerge = 0.05 probReinitializeBegin = 0.03 probReinitializeEnd = 0 adiKR = false useMDL = true iterationMDL = 25

CORE popSize = 100 CopopulationSize = 50 generationLimit = 100 numberOfCopopulations = 15 CrossoverRate = 1.0 ProbMutation = 0.1 RegenerationProbability = 0.5

29

hierarchicalSelectionThreshold = 0 iterationRuleDeletion = 5 iterationHierarchicalSelection = 24 ruleDeletionMinRules = 12 sizePenaltyMinRules = 4 numIterations = 500 initialNumberOfRules = 20 popSize = 400 probCrossover = 0.6 probMutationInd = 0.6 probOne = 0.90 tournamentSize = 3 numStrata = 2 discretizer1 = UniformWidth 4 discretizer2 = UniformWidth 5 discretizer3 = UniformWidth 6 discretizer4 = UniformWidth 7 discretizer5 = UniformWidth 8 discretizer6 = UniformWidth 10 discretizer7 = UniformWidth 15 discretizer8 = UniformWidth 20 discretizer9 = UniformWidth 25 discretizer10 = Disabled maxIntervals = 5 probSplit = 0.05 probMerge = 0.05 probReinitializeBegin = 0.03 probReinitializeEnd = 0 adiKR = true useMDL = true iterationMDL = 25 initialTheoryLengthRatio = 0.075 weightRelaxFactor = 0.90 defaultClass = auto initMethod = cwInit ILGA ProbMutation = 0.01 CrossoverRate = 1.0 popSize = 200 ruleNumber = 30 stagnationLimit = 30 generationLimit = 200 SurvivorsPercent = 0.5 mutationRedRate = 0.5 crossoverRedRate = 0.5 AttributeOrder = descendent incrementalStrategy = 1

initialTheoryLengthRatio = 0.075 weightRelaxFactor = 0.90 defaultClass = auto initMethod = cwInit BioHEL defClass = disabled popSize = 500 selectAlg = tournamentWOR tournamentSize = 4 crossoverProbability = 0.6 mutationProbability = 0.6 elitismEnabled = true numIterations = 100 numRepetitionsLearning = 2 generalizeProbability = 0.1 specializeProbability = 0.1 winMethod = ilas numStrataWindowing = 2 numStages = 2 optMethod = minimize smartInitMethod = true classWiseInit = true expectedRuleSize = 15 probOne = 0.75 useMDL = true numIterationsMDL = 10 initialTheoryLengthRatio = 0.01 mdlWeightRelaxFactor = 0.9 coverageBreakpoint = 0.1 coverageRatio = 0.9

1085

1090

1095

1100

1105

BP M OGA IPSizePhaseI = 10%of Training Data Set Size or 20 whichever is large CrossoverProbabilityPhaseI = 1.0 typeOfCrossoverPhaseI = 1PT mutationProbabilityPhaseI = 0.1 numIterationsPhaseI = 50 IPSizePhaseII = 20 CrossoverProbabilityPhaseII = 1.0 typeOfCrossoverPhaseII = 1PT mutationProbabilityPhaseII = 0.1 classRuleProbability = 0.5 numIterationsPhaseII = 50 NumberOfRunOfBP M OGA = 20

We have used 21 benchmark data sets from the U CI machine learning repository (Asuncion and Newman, 2007) for testing the performance of the algorithms. Table 5 describes characteristics of the data sets containing continuous and categorical features. There may be missing feature values and more than 2 classes (treated as a multi class problem) in a data set. For experimentation, 10 fold Cross Validation (10-CV ) (Weiss and Kulikowski, 1991) technique has been applied to evaluate the performance of each algorithm. Among 15 algorithms some are deterministic and some have few random components. Independent run of each fold of a deterministic algorithm converges to the same solution while the non-deterministic algorithms also achieve the same outcome in most of the cases. We evaluate the performance of each classifier 10 times for each data set. For that we divide the data sets into 10 parts randomly. Every part have 9/10 of the whole data set as training data and have remaining 1/10 of the whole data set as testing data. Each algorithm is tested and average accuracy is calculated for the training and testing data set of a single run of 10-CV . For each data set and for each algorithm we have repeated the whole process 10 times. We do not evaluate the test data (in between phase-I and phase-II of BP M OGA) until all 20 cycles of the two phases are over. At the end one CR is chosen by the procedure described in section 3.4 and that CR is used to calculate the testing accuracy of the test data sets. Average accuracies, standard deviations and ranks are tabulated. Table 6 tabulates the performance of the proposed BP M OGA with 7 other evolutionary crisp rule learning techniques on test data sets. Table 7 tabulates the results of the proposed BP M OGA and 7 non-evolutionary approaches using the same test data sets. Results ob-

30

Table 4: Parameter values of other learning algorithms. M LP − BP Hidden layers = 2 Hidden nodes = 15 Transfer = Htan Eta = 0.15 Alpha = 0.10 Lambda = 0.0 Test data = true Validation data = false Cross validation = false Cycles = 10000 Improve = 0.01 Problem = Classification Tipify inputs = true Verbose = false SaveAll = false NB Bayesian Discretizer

C4.5 rules confidence = 0.25 itemsetsPerLeaf = 2 threshold = 10

IB3 Confidence Level for Acceptance = 0.9 Confidence Level for Dropping = 0.7 Distance Function = Euclidean

EF S − RP S K=1 SizePop = 50 MaxEvaluations = 10000 Beta = 0.99 MutProb = 0.005 Cycle = 100 Implicator = LUKASIEWICZ tnorm = LUKASIEWICZ

CN N Nil

N SLV Population Size = 100 Number of Iterations Allowed without Change = 500 Binary Mutation Probability = 0.01 Integer Mutation Probability = 0.01 Real Mutation Probability = 1.0 Crossover Probability = 1.0

31

Table 5: Characteristics of the data sets used in this paper (Here, # indicates number of).

1110

1115

1120

1125

1130

id

Name

ato crx der d-p eco fla gls h-c hab irs lbr led mnk n-t son vle vwl wne wis yes zoo

Automobile Credit Dermatology Diabates(Pima) Ecoli Flare Glass Heart(Cleve) Haberman Iris Labor Led7digit Monk New Thyroid Sonar Vehicle Vowel Wine Wisconsin Yeast Zoo

#Records

#features

205 653 366 768 336 1066 214 303 306 150 57 500 432 215 208 846 990 178 683 1484 101

26 16 35 9 8 12 10 14 4 5 17 8 7 6 61 19 14 14 10 9 17

#Continuous features 15 6 34 8 7 0 9 10 3 4 8 7 6 5 60 18 10 13 9 8 0

#Categorical features 11 10 1 1 1 12 1 4 1 1 9 1 1 1 1 1 4 1 1 1 17

#Classes 6 2 6 2 8 6 6 7 2 3 2 10 2 3 2 4 11 3 2 10 7

tained using the training data sets are shown in Tables 8 and 9 for the sake of completeness. It also shows that the proposed classifier is not suffer from over-fitting. In Table 6, only U CS shows higher average accuracy than BP M OGA mainly due to higher accuracy given by vowel data set, but the average rank of U CS is higher than BP M OGA (lower value means better). For 8 evolutionary based and 1 non-evolutionary based algorithms, the number of CSR with standard deviation is tabulated in Table 10. The number of rules used by U CS is much higher than BP M OGA. In Table 7, EF S − RP S and N SLV offer higher average accuracies than BP M OGA but average ranks are higher than BP M OGA. We have developed BP M OGA in java and other algorithms are developed and run by KEEL. Run times of each algorithms for each data sets (10 runs of 10-CV ) are tabulated in Table 11. We have run it by a Intel(R) Core(TM) i53230M CPU 2.60GHz processor with 4GB RAM in Windows 7 operating system. Non-evolutionary algorithms are taking less time to run except EF S −RP S and N SLV . Run times of BP M OGA is comparable with run times of evolutionary algorithms. Any conclusive decision cannot be take as they are running in different set ups. In future we will try to implement BP M OGA in KEEL. In the rule based classifier extracted by BP M OGA, decision boundaries separating the classes are always axis-parallel like C4.5 rules (Quinlan, 2014). Still, they show better classification accuracies as compared with the six other classifiers compared here which are capable to produce nonaxis-parallel classification boundaries. All evolutionary learners, including BP M OGA, have used the default rule at the end to classify unclassified instances, so the coverage is always 100%. We have done various relevant statistical tests to infer the comparability the performance of the proposed BP M OGA technique and its counterparts. 32

Have missing values Y N Y N N N N Y N N Y N N N N N N N Y N N

33

61.98 (±2.30)(6) 81.17 (±0.95)(8) 87.73 (±1.93)(6) 73.55 (±1.26)(4) 77.16 (±1.13)(5) 74.09 (±0.56)(5) 65.88 (±2.23)(3) 53.78 (±1.83)(4) 74.09 (±1.19)(2) 94.00 (±1.22)(5) 69.26 (±0.80)(4) 97.22 (±0.00)(5) 91.52 (±1.12)(5) 50.87 (±1.11)(8) 62.98 (±1.54)(6) 64.90 (±1.53)(3) 81.58 (±2.82)(8) 96.53 (±0.40)(1) 56.51 (±0.83)(3) 94.43 (±2.01)(5) 75.46 (±1.34)(4.80)

ato

Ave.

zoo

yes

wis

wne

vwl

vle

son

n-t

mnk

led

lbr

irs

hab

h-c

gls

fla

eco

d-p

der

crx

HIDER

id

67.95 (±2.55)(3) 85.00 (±0.86)(5) 95.19 (±0.84)(1) 75.37 (±0.59)(1) 78.71 (±1.48)(1) 72.01 (±0.91)(6) 65.06 (±2.30)(4) 53.70 (±1.22)(5) 72.41 (±0.93)(5) 93.73 (±2.06)(6) 82.70 (±3.95)(5) 69.46 (±1.11)(3) 98.47 (±0.55)(4) 93.78 (±0.11)(1) 74.73 (±2.74)(3) 72.23 (±0.88)(1) 84.70 (±1.03)(1) 92.11 (±1.47)(5) 96.32 (±0.29)(2) 56.13 (±1.16)(4) 95.45 (±1.23)(1) 79.77 (±1.39)(3.19)

U CS

GAssist− ADI 64.34 (±2.00)(5) 86.07 (±0.59)(1) 94.58 (±0.90)(4) 74.19 (±0.67)(2) 77.83 (±1.57)(3) 74.10 (±0.41)(3.5) 62.86 (±1.07)(6) 55.88 (±1.21)(2) 72.64 (±0.94)(3) 96.53 (±0.76)(1) 96.23 (±1.26)(1) 60.32 (±2.10)(6) 99.14 (±0.59)(3) 91.20 (±1.56)(7) 75.62 (±2.29)(1) 66.58 (±0.98)(4) 39.60 (±1.40)(5.5) 93.46 (±1.57)(4) 95.91 (±0.70)(4) 52.38 (±0.93)(5) 92.30 (±1.52)(6.5) 77.23 (±1.19)(3.69)

GAssist− Intervalar 64.91 (±2.66)(4) 85.70 (±0.77)(3) 94.21 (±1.25)(5) 73.50 (±1.15)(5) 77.55 (±1.96)(4) 74.10 (±0.41)(3.5) 63.38 (±1.73)(5) 57.24 (±1.69)(1) 72.03 (±1.36)(7) 95.80 (±1.54)(2) 95.73 (±1.40)(2) 65.30 (±1.66)(5) 99.42 (±0.47)(2) 91.60 (±0.70)(4) 72.07 (±1.29)(5) 63.83 (±1.05)(5) 39.60 (±1.40)(5.5) 91.64 (±1.96)(6) 94.72 (±0.38)(5) 49.75 (±1.74)(6) 92.30 (±1.52)(6.5) 76.88 (±1.34)(4.36) 26.59 (±0.75)(8) 82.59 (±1.44)(7) 43.45 (±0.01)(8) 72.31 (±0.74)(7) 64.75 (±0.69)(7) 66.17 (±0.84)(7) 52.09 (±1.19)(7) 52.57 (±1.65)(7) 72.58 (±1.11)(4) 93.46 (±0.93)(8) 87.40 (±5.72)(3) 24.92 (±1.01)(7) 93.10 (±0.77)(6) 91.20 (±1.41)(6) 53.38 (±0.00)(7) 37.61 (±1.22)(8) 21.22 (±0.83)(8) 94.05 (±0.91)(2) 94.07 (±0.48)(7) 38.73 (±0.68)(8) 94.54 (±1.41)(4) 64.61 (±1.13)(6.48)

CORE 56.27 (±1.70)(7) 85.86 (±0.52)(2) 60.16 (±2.42)(7) 73.10 (±0.74)(6) 63.48 (±2.17)(8) 62.30 (±1.28)(8) 48.14 (±2.88)(8) 52.34 (±1.10)(8) 72.31 (±0.69)(6) 93.60 (±1.30)(7) 80.53 (±4.23)(7) 00.00 (±0.00)(8) 54.13 (±1.02)(8) 90.72 (±1.11)(8) 71.06 (±1.83)(6) 58.39 (±1.26)(7) 36.94 (±1.82)(7) 88.00 (±2.73)(7) 90.77 (±0.49)(8) 41.12 (±1.39)(7) 85.65 (±1.95)(8) 64.99 (±1.56)(7.05)

ILGA

BP M OGA 73.83 (±2.64)(2) 85.64 (±0.49)(4) 95.16 (±0.88)(3) 74.14 (±0.61)(3) 78.50 (±1.57)(2) 74.89 (±0.47)(1) 70.04 (±1.56)(2) 54.69 (±0.92)(3) 74.48 (±1.02)(1) 95.73 (±0.83)(3) 83.03 (±5.46)(4) 70.12 (±0.77)(2) 89.29 (±7.00)(7) 93.72 (±0.84)(2) 72.32 (±1.34)(4) 68.12 (±0.75)(3) 61.33 (±0.90)(4) 94.23 (±0.89)(1) 96.08 (±0.44)(3) 58.51 (±0.70)(1) 95.32 (±0.94)(2) 79.01 (±1.48)(2.71)

BioHEL 75.54 (±1.98)(1) 83.26 (±1.03)(6) 95.17 (±0.97)(2) 70.48 (±1.37)(8) 77.07 (±0.77)(6) 74.50 (±0.47)(2) 70.11 (±1.91)(1) 52.97 (±2.06)(6) 63.59 (±2.35)(8) 94.00 (±0.31)(4) 80.73 (±4.68)(6) 70.30 (±0.98)(1) 100.00 (±0.00)(1) 93.50 (±1.42)(3) 75.56 (±2.56)(2) 71.28 (±0.75)(2) 72.93 (±1.53)(2) 93.50 (±0.97)(3) 94.67 (±0.52)(6) 58.44 (±0.62)(2)) 95.03 (±0.96)(3) 79.17 (±1.34)(3.57)

Table 6: Comparison of the performance of BP M OGA and the performance of the evolutionary crisp rule learners on the test data set. Average accuracy, s.d. and rank are tabulated in the columns.

34

Ave.

zoo

yes

wis

wne

vwl

vle

son

n-t

mnk

led

lbr

irs

hab

h-c

gls

fla

eco

d-p

der

crx

ato

id

M LP − BP 17.72 (±6.05)(7) 77.72 (±4.81)(7) 61.13 (±3.66)(8) 70.21 (±0.74)(5) 36.77 (±2.22)(8) 46.62 (±4.41)(7) 34.39 (±3.02)(8) 49.80 (±2.22)(7) 65.55 (±6.01)(5) 68.60 (±5.68)(8) 86.93 (±2.77)(3) 22.00 (±3.72)(7) 73.95 (±2.37)(7) 81.28 (±2.26)(8) 65.66 (±2.60)(8) 37.43 (±1.26)(8) 15.30 (±0.80)(6) 92.54 (±2.14)(7) 96.41 (±0.31)(2) 32.14 (±2.46)(8) 51.47 (±8.44)(8) 56.36 (±3.24)(6.86)

C4.5 rules 72.57 (±2.24)(6) 85.45 (±0.86)(3) 95.98 (±0.75)(1) 72.89 (±0.79)(3) 77.89 (±1.04)(3) 69.50 (±0.64)(4) 66.64 (±1.76)(4) 53.56 (±2.03)(2) 95.93 (±0.66)(2) 75.80 (±3.35)(8) 71.12 (±0.73)(2) 100.00 (±0.00)(1.5) 93.16 (±0.98)(5) 74.29 (±2.21)(5) 65.49 (±0.74)(5) 68.30 (±1.46)(3) 93.64 (±0.61)(6) 95.73 (±0.38)(4) 55.57 (±1.05)(2) 92.87 (±0.87)(5) 78.82 (±1.16)(3.83)

15.86 (±1.74)(8) 64.31 (±2.65)(8) 92.07 (±0.88)(7) 64.88 (±1.20)(8) 70.32 (±2.30)(6) 31.96 (±1.44)(8) 65.91 (±2.91)(5) 48.75 (±1.29)(8) 61.85 (±2.43)(7) 92.27 (±2.52)(7) 77.50 (±4.93)(7) 36.40 (±1.79)(6) 81.22 (±1.84)(5) 93.59 (±0.96)(4) 80.85 (±2.04)(3) 65.38 (±1.59)(6) 95.48 (±0.76)(2) 92.33 (±1.79)(8) 94.73 (±0.81)(6) 45.92 (±0.85)(7) 85.77 (±3.80)(7) 69.40 (±1.97)(6.48)

IB3

75.99 (±1.37)(2) 83.57 (±0.42)(4) 94.53 (±0.65)(5) 71.46 (±0.97)(4) 74.55 (±0.34)(5) 77.21 (±0.93)(1) 63.05 (±1.87)(7) 53.15 (±0.71)(4) 71.11 (±0.62)(3) 93.40 (±0.73)(5) 92.07 (±1.22)(1) 11.00 (±0.00)(8) 52.78 (±0.00)(8) 94.87 (±0.22)(2) 68.13 (±1.23)(7) 62.56 (±0.62)(7) 27.60 (±0.56)(5) 97.04 (±0.66)(1) 96.94 (±0.14)(1) 54.01 (±0.40)(3) 94.15 (±0.87)(4) 71.87 (±0.69)(4.24)

NB

EF S− RP S 79.65 (±1.18)(1) 81.74 (±1.18)(5) 95.19 (±1.11)(3) 66.20 (±1.24)(7) 77.67 (±0.57)(4) 71.11 (±1.58)(3) 74.55 (±1.14)(1) 51.69 (±2.63)(5) 66.33 (±3.78)(4) 95.67 (±0.72)(4) 83.13 (±5.31)(5) 62.74 (±3.11)(5) 100.00 (±0.00)(1.5) 96.98 (±0.06)(1) 85.82 (±2.03)(2) 66.25 (±1.38)(2) 98.79 (±0.27)(7) 95.16 (±0.81)(6) 95.24 (±0.41)(7) 51.66 (±0.74)(4) 94.21 (±1.40)(3) 80.47 (±1.46)(3.5) 75.56 (±1.26)(3) 80.98 (±0.49)(6) 95.96 (±0.36)(2) 69.49 (±0.85)(6) 67.29 (±0.92)(7) 65.13 (±0.35)(6) 69.79 (±1.11)(3) 51.66 (±1.17)(6) 64.84 (±1.27)(6) 92.87 (±0.63)(6) 88.33 (±7.22)(2) 72.34 (±0.57)(1) 74.33 (±1.43)(6) 87.14 (±0.90)(7) 88.53 (±0.75)(1) 72.85 (±0.65)(1) 98.99 (±0.16)(1) 96.63 (±0.88)(5) 95.29 (±0.27)(5) 47.11 (±0.52)(6) 92.05 (±0.05)(6) 78.44 (±1.04)(4.24)

CN N

73.98 (±2.16)(4) 85.81 (±0.66)(1) 94.35 (±0.63)(6) 74.82 (±0.68)(1) 81.32 (±1.04)(1) 67.20 (±0.79)(5) 64.90 (±2.47)(6) 53.19 (±1.65)(3) 73.59 (±0.07)(2) 96.67 (±0.63)(1) 84.17 (±3.33)(4) 63.36 (±1.03)(4) 98.10 (±0.45)(3) 92.57 (±0.41)(6) 76.22 (±0.86)(4) 66.34 (±1.26)(4) 84.30 (±1.60)(2) 94.00 (±1.04)(9) 94.98 (±0.73)(2) 47.12 (±1.39)(5) 94.61 (±1.25)(2) 79.12 (±1.25)(3.66)

N SLV

73.83 (±2.64)(5) 85.64 (±0.49)(2) 95.16 (±0.88)(4) 74.14 (±0.61)(2) 78.50 (±1.57)(2) 74.89 (±0.47)(2) 70.04 (±1.56)(2) 54.69 (±0.92)(1) 74.48 (±1.02)(1) 95.73 (±0.83)(2) 83.03 (±5.46)(6) 70.12 (±0.77)(3) 89.29 (±7.00)(4) 93.72 (±0.84)(3) 72.32 (±1.34)(6) 68.12 (±0.75)(6) 61.33 (±0.90)(4) 94.23 (±0.89)(3) 96.08 (±0.44)(3) 58.51 (±0.70)(1) 95.32 (±0.94)(1) 79.01 (±1.48)(3.00)

BP M OGA

Table 7: Comparison of the performance of BP M OGA and the performance of the other learners on the test data set. Average accuracy, s.d. and rank are tabulated in the columns.

35

96.00 (±0.29)(3) 87.51 (±0.53)(7) 94.81 (±0.41)(6) 78.02 (±0.32)(5) 88.27 (±0.19)(4) 80.54 (±0.08)(2) 88.75 (±0.72)(2) 82.17 (±0.83)(3) 77.26 (±0.03)(6) 97.42 (±0.09)(5) 77.70 (±0.14)(2) 97.22 (±0.00)(5) 96.39 (±0.20)(5) 98.23 (±0.16)(3) 85.22 (±0.39)(3) 92.98 (±0.29)(2) 97.43 (±0.28)(7) 97.29 (±0.03)(6) 70.81 (±0.51)(3) 99.37 (±0.26)(5) 89.17 (±0.30)(4.20)

ato

Ave.

zoo

yes

wis

wne

vwl

vle

son

n-t

mnk

led

lbr

irs

hab

h-c

gls

fla

eco

d-p

der

crx

HIDER

id

96.86 (±0.51)(2) 95.35 (±0.42)(2) 99.80 (±0.07)(1) 84.53 (±1.17)(2) 91.03 (±4.63)(2) 82.48 (±0.39)(1) 85.21 (±1.70)(4) 96.42 (±0.65)(1) 80.45 (±0.99)(4) 94.93 (±1.25)(8) 99.32 (±0.26)(4) 78.84 (±0.29)(1) 98.75 (±0.51)(4) 96.20 (±0.94)(6) 99.53 (±0.27)(2) 98.14 (±0.23)(1) 99.63 (±0.10)(1) 97.40 (±1.10)(8) 98.87 (±0.17)(3) 80.80 (±0.54)(1) 100.00 (±0.00)(1) 93.07 (±0.77)(2.81)

UCS

GAssistADI 76.82 (±0.81)(6) 90.52 (±0.22)(3) 96.51 (±0.50)(5) 81.66 (±0.21)(3) 81.92 (±0.35)(6) 76.70 (±0.35)(5.5) 72.51 (±0.80)(5) 68.96 (±0.53)(6) 81.23 (±0.43)(3) 98.21 (±0.15)(2) 100.00 (±0.00)(2.5) 63.44 (±0.68)(6) 99.42 (±0.21)(3) 97.75 (±0.25)(3) 95.36 (±0.39)(5) 70.76 (±0.34)(5) 43.13 (±0.83)(5.5) 99.52 (±0.10)(5) 98.77 (±0.14)(1) 54.09 (±0.72)(5) 97.25 (±0.52)(6.5) 83.07 (±0.41)(4.33)

GAssistIntervalar 77.54 (±0.76)(5) 90.06 (±0.25)(4) 98.49 (±0.15)(4) 81.39 (±0.29)(4) 82.63 (±0.36)(5) 76.70 (±0.35)(5.5) 71.80 (±0.53)(6) 70.91 (±0.56)(4) 81.87 (±0.35)(2) 98.04 (±0.15)(3) 100.00 (±0.00)(2.5) 70.03 (±1.03)(5) 99.51 (±0.33)(2) 97.52 (±0.29)(4) 96.91 (±0.48)(4) 68.17 (±0.80)(6) 43.13 (±0.83)(5.5) 99.14 (±0.21)(3) 98.81 (±0.10)(4) 52.27 (±0.77)(6) 97.25 (±0.52)(6.5) 83.44 (±0.43)(4.29) 27.48 (±0.37)(8) 82.62 (±1.18)(8) 43.44 (±0.00)(8) 73.52 (±0.29)(8) 66.71 (±0.72)(8) 66.60 (±0.53)(7) 56.08 (±0.64)(7) 56.27 (±0.16)(8) 75.80 (±0.15)(7) 96.03 (±0.24)(7) 91.46 (±3.94)(7) 24.98 (±0.97)(7) 93.56 (±0.60)(6) 93.46 (±0.35)(8) 53.36 (±0.00)(8) 38.38 (±0.85)(8) 21.88 (±1.06)(8) 99.13 (±0.19)(4) 94.64 (±0.34)(7) 39.21 (±0.71)(8) 99.39 (±0.48)(4) 66.41 (±0.66)(7.19)

CORE 64.47 (±1.60)(7) 88.05 (±0.16)(6) 66.23 (±2.45)(7) 76.63 (±0.24)(7) 67.23 (±2.07)(7) 63.38 (±1.27)(8) 56.53 (±2.57)(8) 58.79 (±0.41)(7) 74.54 (±0.30)(8) 97.34 (±0.22)(6) 98.87 (±0.72)(5) 0.00 (±0.00)(8) 58.76 (±0.18)(8) 95.73 (±0.46)(7) 85.42 (±0.77)(7) 62.35 (±1.22)(7) 40.10 (±2.22)(7) 98.61 (±0.40)(6) 94.29 (±0.12)(8) 41.89 (±1.42)(7) 94.82 (±1.08)(8) 70.71 (±0.95)(7.09)

ILGA

BPMOGA 86.79 (±0.81)(4) 88.90 (±0.17)(5) 98.78 (±0.26)(3) 77.72 (±0.23)(6) 88.65 (±0.21)(3) 78.19 (±0.21)(4) 86.14 (±0.54)(3) 69.97 (±0.48)(5) 80.25 (±0.30)(5) 97.45 (±0.66)(4) 96.76 (±0.50)(6) 76.06 (±2.95)(4) 90.34 (±7.34)(7) 98.61 (±0.11)(2) 89.31 (±0.48)(6) 73.47 (±0.18)(4) 71.66 (±0.48)(4) 99.95 (±0.04)(2) 97.53 (±0.34)(5) 64.34 (±0.26)(4) 99.50 (±0.21)(3) 86.21 (±0.80)(4.24)

BioHEL 99.74 (±0.10)(1) 98.36 (±0.13)(1) 99.39 (±0.07)(2) 96.19 (±0.17)(1) 95.16 (±0.50)(1) 78.50 (±0.28)(3) 98.86 (±0.17)(1) 82.51 (±0.86)(2) 89.84 (±0.65)(1) 100.00 (±0.00)(1) 100.00 (±0.00)(1) 76.64 (±0.25)(3) 100.00 (±0.00)(1) 99.96 (±0.04)(1) 100.00 (±0.00)(1) 92.50 (±0.10)(1) 88.47 (±5.55)(3) 100.00 (±0.00)(1) 99.23 (±0.21)(2) 73.59 (±0.32)(2) 100.00 (±0.00)(1) 93.76 (±0.45)(1.52)

Table 8: Comparison of the performance of BP M OGA and the performance of the evolutionary crisp rule learners on the training data set. Average accuracy, s.d. and rank are tabulated in the columns.

36

Ave.

zoo

yes

wis

wne

vwl

vle

son

n-t

mnk

led

lbr

irs

hab

h-c

gls

fla

eco

d-p

der

crx

ato

id

MLPBP 19.13 (±6.05)(8) 78.24 (±5.07)(7) 61.64 (±3.56)(8) 71.31 (±0.72)(7) 47.74 (±2.99)(8) 36.93 (±2.29)(8) 36.64 (±1.35)(8) 52.80 (±1.34)(6) 67.24 (±5.23)(5) 69.74 (±4.71)(8) 99.84 (±0.24)(1) 21.57 (±3.04)(7) 74.48 (±2.63)(7) 83.46 (±1.82)(8) 69.72 (±1.71)(8) 37.84 (±1.10)(8) 16.10 (±0.65)(8) 92.73 (±1.60)(7) 97.19 (±0.13)(4) 32.43 (±1.82)(8) 54.08 (±8.21)(8) 58.14 (±2.68)(7.00)

C4.5 rules 83.44 (±0.93)(5) 87.74 (±0.23)(4) 98.16 (±0.09)(4) 77.85 (±0.60)(3) 87.17 (±0.26)(3) 73.26 (±0.42)(4) 83.30 (±1.20)(3) 64.08 (±0.56)(4) 97.42 (±0.05)(3) 89.27 (±1.12)(7) 75.94 (±0.17)(2) 100.00 (±0.00)(1.5) 97.12 (±0.22)(3) 93.19 (±0.81)(4) 72.36 (±0.51)(6) 82.53 (±0.92)(5) 98.60 (±0.07)(5) 97.82 (±0.07)(2) 62.09 (±0.51)(3) 98.67 (±0.06)(4) 86.00 (±0.44)(3.76)

55.11 (±0.66)(7) 75.33 (±0.63)(8) 89.71 (±0.59)(7) 61.96 (±0.92)(8) 64.18 (±2.11)(7) 43.08 (±1.06)(7) 58.35 (±0.62)(7) 43.15 (±0.71)(8) 58.60 (±0.99)(7) 89.89 (±0.98)(7) 73.16 (±9.04)(8) 35.47 (±1.32)(6) 74.60 (±0.75)(6) 91.35 (±0.84)(6) 76.40 (±0.89)(7) 57.71 (±0.31)(7) 80.34 (±0.17)(6) 89.55 (±1.10)(8) 94.06 (±0.66)(8) 40.09 (±0.23)(7) 80.67 (±1.14)(7) 68.23 (±1.22)(7.09)

IB3

93.84 (±0.17)(2) 92.40 (±0.05)(1) 96.55 (±0.12)(5) 91.88 (±0.20)(1) 88.93 (±0.15)(1) 76.24 (±0.05)(2) 98.29 (±0.21)(1) 73.04 (±0.32)(2) 78.59 (±0.08)(2) 95.73 (±0.15)(5) 96.55 (±0.13)(5) 11.44 (±0.00)(8) 52.78 (±0.00)(8) 97.71 (±0.05)(2) 100.00 (±0.00)(1) 80.07 (±0.16)(2) 100.00 (±0.00)(1) 100.00 (±0.00)(1) 97.18 (±0.06)(5) 67.68 (±0.12)(1) 99.65 (±0.11)(2) 85.17 (±0.10)(2.76)

NB

EFSRPS 85.82 (±0.28)(4) 85.48 (±0.16)(5) 98.28 (±0.13)(3) 71.45 (±0.44)(6) 78.94 (±0.11)(5) 72.17 (±0.91)(5) 78.86 (±0.25)(5) 59.19 (±0.30)(5) 68.05 (±3.37)(4) 96.36 (±0.32)(4) 98.01 (±2.00)(3) 63.12 (±3.20)(5) 100.00 (±0.00)(1.5) 94.57 (±0.79)(5) 95.50 (±0.21)(3) 77.83 (±0.33)(3) 99.21 (±0.02)(2) 98.84 (±0.10)(4) 96.80 (±0.17)(6) 52.76 (±0.08)(4) 98.23 (±0.14)(5) 84.26 (±0.64)(4.17) 75.26 (±0.24)(6) 81.01 (±0.06)(6) 96.31 (±0.05)(6) 71.46 (±0.97)(5) 70.53 (±0.22)(6) 65.08 (±0.06)(6) 70.15 (±0.25)(6) 52.34 (±0.16)(7) 65.87 (±0.33)(6) 92.78 (±0.19)(6) 89.91 (±6.11)(6) 72.99 (±0.20)(3) 74.68 (±0.61)(5) 88.32 (±0.21)(7) 89.84 (±0.30)(5) 73.14 (±0.08)(5) 98.91 (±0.03)(3) 97.22 (±0.09)(6) 95.29 (±0.04)(7) 47.52 (±0.17)(6) 97.47 (±0.00)(6) 79.34 (±0.49)(5.67)

CNN

BPMOGA

86.79 (±0.81)(3) 88.90 (±0.17)(3) 98.78 (±0.26)(2) 77.72 (±0.23)(4) 88.65 (±0.21)(2) 78.19 (±0.21)(1) 86.14 (±0.54)(2) 69.97 (±0.48)(3) 80.25 (±0.30)(1) 97.45 (±0.66)(2) 96.76 (±0.50)(4) 76.06 (±2.95)(1) 90.34 (±7.34)(4) 98.61 (±0.11)(1) 89.31 (±0.48)(6) 73.47 (±0.18)(4) 71.66 (±0.48)(7) 99.95 (±0.04)(2) 97.53 (±0.34)(3) 64.34 (±0.26)(2) 99.50 (±0.21)(3) 86.21 (±0.80)(2.86)

NSLV

97.76 (±0.43)(1) 90.21 (±0.50)(2) 99.20 (±0.19)(1) 80.29 (±0.49)(2) 86.88 (±0.73)(4) 73.96 (±0.94)(3) 79.71 (±0.52)(4) 91.47 (±0.90)(1) 78.31 (±0.30)(3) 97.69 (±0.41)(1) 99.40 (±0.48)(2) 68.40 (±0.93)(4) 97.93 (±0.35)(3) 94.75 (±0.40)(4) 97.99 (±0.43)(2) 84.16 (±0.96)(1) 94.57 (±1.01)(4) 99.61 (±0.26)(3) 98.89 (±0.13)(1) 50.37 (±1.24)(5) 100.00 (±0.00)(1) 88.64 (±0.55)(2.48)

Table 9: Comparison of the performance of BP M OGA and the performance of the other learners on the training data set. Average accuracy, s.d. and rank are tabulated in the columns.

37

Ave.

zoo

yes

wis

wne

vwl

vle

son

n-t

mnk

led

lbr

irs

hab

h-c

gls

fla

eco

d-p

der

crx

id id ato

U CS

5747.66 (±33.40) 4517.33 (±74.40) 4025.00 (±118.92) 4237.37 (±58.76) 3403.85 (±67.00) 4952.20 (±21.44) 3718.91 (±73.82) 5028.52 (±24.33) 1958.16 (±62.66) 1199.52 (±182.25) 1931.70 (±261.23) 4229.69 (±23.37) 645.42 (±86.81) 1312.15 (±104.41) 5941.15 (±48.98) 5767.45 (±17.37) 5647.25 (±18.15) 3064.36 (±338.98) 1780.63 (±75.91) 4773.51 (±19.35) 1518.52 (±59.90) 3590.49 (±84.35)

HIDER

42.31 (±0.96) 17.17 (±0.36) 9.33 (±0.32) 9.96 (±0.41) 11.38 (±0.28) 26.39 (±1.00) 21.42 (±0.60) 21.94 (±1.09) 2.44 (±0.71) 3.00 (±0.00) 18.55 (±0.84) 2.00 (±0.00) 3.34 (±0.12) 178.02 (±0.17) 47.28 (±1.21) 114.49 (±2.34) 28.81 (±0.42) 2.19 (±0.11) 44.99 (±1.69) 7.16 (±0.13) 30.61 (±0.64)

GAssist− ADI 6.49 (±0.45) 5.55 (±0.26) 5.85 (±0.26) 5.73 (±0.27) 5.08 (±0.32) 6.16 (±0.61) 4.39 (±0.39) 5.88 (±0.57) 4.46 (±0.40) 3.10 (±0.09) 3.00 (±0.00) 8.45 (±0.45) 3.00 (±0.05) 4.06 (±0.22) 5.11 (±0.64) 5.75 (±0.36) 9.52 (±0.74) 3.33 (±0.17) 3.29 (±0.16) 4.95 (±0.35) 6.12 (±0.30) 5.20 (±0.33)

GAssist− Intervalar 8.62 (±0.49) 5.61 (±0.49) 8.83 (±0.64) 6.72 (±0.44) 6.05 (±0.32) 6.16 (±0.61) 5.48 (±0.25) 8.66 (±0.42) 5.8 (±0.34) 3.20 (±0.17) 3.00 (±0.00) 10.07 (±0.38) 2.80 (±0.12) 4.15 (±0.16) 9.16 (±0.35) 6.69 (±0.46) 9.53 (±0.73) 3.63 (±0.25) 5.23 (±0.38) 5.19 (±0.41) 6.09 (±0.26) 6.21 (±0.37)

12.25 (±0.83) 4.08 (±0.95) 14.00 (±0.00) 4.17 (±1.00) 6.25 (±0.57) 4.83 (±0.69) 6.61 (±0.49) 4.77 (±1.05) 3.75 (±0.58) 3.32 (±0.26) 2.33 (±0.67) 4.79 (±0.49) 3.29 (±0.40) 3.18 (±0.19) 1.00 (±0.00) 4.27 (±0.62) 4.69 (±0.72) 3.05 (±0.10) 7.05 (±0.98) 7.03 (±0.53) 7.68 (±0.46) 5.35 (±0.55)

CORE

30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00) 30.00 (±0.00)

ILGA

14.81 (±0.34) 19.17 (±0.84) 7.67 (±0.29) 23.38 (±0.44) 16.40 (±0.44) 8.42 (±0.71) 15.96 (±0.24) 19.96 (±0.73) 22.82 (±0.74) 5.37 (±0.18) 3.36 (±0.28) 13.88 (±0.55) 2.00 (±0.00) 6.02 (±0.27) 6.02 (±0.15) 25.52 (±0.40) 33.13 (±0.96) 2.53 (±0.19) 5.69 (±0.52) 34.24 (±0.95) 6.00 (±0.00) 13.92 (±0.44)

BioHEL

C4.5 rules 19.31 (±0.61) 11.30 (±0.48) 7.12 (±0.15) 7.81 (±0.84) 9.55 (±0.52) 27.35 (±0.84) 9.19 (±0.49) 8.79 (±0.45) 4.00 (±0.00) 3.78 (±0.17) 15.35 (±0.34) 5.00 (±0.00) 5.48 (±0.27) 8.06 (±0.34) 20.15 (±0.46) 51.11 (±0.82) 4.12 (±0.16) 9.20 (±0.59) 33.81 (±1.01) 7.69 (±0.03) 13.41 (±0.43) 22.20 (±1.08) 8.63 (±0.95) 10.27 (±0.46) 13.21 (±1.46) 14.45 (±0.72) 31.24 (±3.31) 15.40 (±0.45) 49.44 (±0.93) 5.32 (±0.46) 3.19 (±0.12) 4.22 (±0.31) 14.36 (±1.11) 2.95 (±0.22) 5.66 (±0.23) 10.68 (±0.25) 43.31 (±2.25) 82.67 (±2.26) 6.48 (±0.46) 9.36 (±0.76) 18.08 (±0.90) 7.90 (±0.17) 18.05 (±0.90)

N SLV

24.90 (±1.08) 16.46 (±1.82) 12.88 (±0.86) 47.97 (±2.33) 34.71 (±1.77) 49.39 (±2.16) 28.75 (±1.04) 46.14 (±1.69) 14.77 (±2.29) 4.14 (±0.20) 4.52 (±0.50) 29.78 (±0.99) 3.48 (±0.38) 6.31 (±0.41) 16.46 (±1.11) 52.95 (±3.92) 115.33 (±2.49) 6.17 (±0.61) 9.40 (±1.57) 147.81 (±3.74) 7.43 (±0.23) 32.37 (±1.49)

BP M OGA

Table 10: Comparison of the number of CSR in classifier generated by evolutionary learners and one non-evolutionary learners. Average number of rules and s.d. are tabulated in the columns.

38

ato crx der d-p eco fla gls h-c hab irs lbr led mnk n-t son vle vwl wne wis yes zoo

id

M LP − BP 118 90 84 66 67 169 105 59 47 44 61 59 56 62 102 88 95 74 77 81 87

C4.5 rules 44 99 57 117 57 123 84 104 32 36 59 75 43 66 45 448 50 85 1609 53 53 46 47 45 27 71 102 37 30 27 27 36 36 47 52 80 70 42 56 75 43

IB3 128 99 91 97 72 101 171 71 58 46 55 64 58 68 236 154 2385 85 79 95 41

NB

EF S− RP S 1850 10755 8622 10115 1840 22375 1290 2123 1227 207 133 4627 3183 703 3556 20619 1589 610 2000 44144 334 31 42 199 61 36 81 26 69 29 42 35 44 45 39 50 49 75 43 59 79 43

CN N 1913 1029 850 1570 979 2775 414 2981 126 54 109 453 96 150 3639 13171 1532 272 1550 2923 132

N SLV 6609 3203 3076 1741 842 4876 2933 3561 197 224 2643 561 471 8252 22647 4492 5436 261 8717 764

HIDER 12664 6160 13445 6769 3494 5321 4384 7823 1492 941 3297 5185 595 1942 24014 23365 9432 7145 1121 8087 2525

U CS

GAssist− ADI 2580 3151 6227 3000 897 945 1155 2161 765 423 1164 2082 1698 982 8785 7002 2321 2255 450 3766 310

GAssist− Intervalar 1459 1950 5136 1376 759 1041 590 1336 676 219 373 1456 3717 474 4046 3635 1212 1255 300 2302 325 3726 4784 7750 3827 2951 7421 2009 1931 1117 910 913 1871 2669 1512 8880 7536 2134 3443 1050 8406 2450

CORE

Table 11: Comparison of the run times in seconds of BP M OGA and other learners. 7202 6749 18336 3634 2533 8134 2355 3015 473 245 1015 1687 935 907 4668 24640 5555 6316 1300 12138 2459

ILGA

6371 7208 1400 2319 870 7766 782 1144 986 210 603 1135 241 318 680 2693 4733 5553 653 3533 1213

BioHEL

6358 6043 3852 11490 5882 9650 10740 9570 160 276 220 4200 180 690 3000 24560 122160 1210 1190 171410 220

BP M OGA

4.1. Statistical test 1135

1140

Since we do not know the distribution of the data produced by different classifiers, we perform non-parametric statistical tests. We have done Friedman’s test (Friedman, 1937, 1940) with significance level α = 0.05 which is a non-parametric equivalent of the repeated-measures AN OV A (Fisher, 1955). For this we have used ranks of the algorithms in Tables 6 and 7. Each tables tabulate performance of 7 classifiers on 21 data sets. So, here k (number of classifier) = 7 and n (number of data points) = 21. Then we calculate the value of M by using equation (10) M=

1145

1150

1155

1160

X 12 Rj2 + 3n(k + 1) nk(k + 1)

(10)

where Rj is the sum of the ranks in column j of the Tables 6 and 7 i.e. summation of ranks of each classifier. M values calculated from Tables 6 and 7 are respectively 285.36 and 273.82. Which are much greater than critical value 12.59 at significance level α = 0.05. Thus we can say that, it tests and rejects the null hypothesis which states that on an average all the classifiers perform equivalently. Statistical analysis has been performed by comparing the performance of each pair of learners using the non-parametric Wilcoxon signed-rank test (Wilcoxon, 1945) (2-tailed hypothesis with significance level α = 0.05) by an online calculator (url:http://www.socscistatistics.com/ tests/signedranks/Default2.aspx). We use classification accuracies for different classifiers from Tables 6 and 7 for this purpose. Table 12 shows the approximate p-values for the pairwise comparison of evolutionary classifiers on test data sets according to the Wilcoxon signedrank test. The symbols S+/S- indicate that the performance of the method in the row significantly superior/inferior compared to the performance of the method shown in the column. Similarly, the symbols NS+/NS- denote a nonsignificant improvement/degradation. Table 13 shows the approximate p-values for the pairwise comparison of non-evolutionary classifiers. Last rows of Tables 12 and 13 show that according to Wilcoxon signed-rank test BP M OGA based classifier is better than all other classifiers (including U CS, EF S − RP S and N SLV ) in terms of classification accuracy of test data sets. 5. Conclusion

1165

1170

The proposed BP M OGA employs a local search unlike other GA based approaches where local search is applied separately. Phase-I builds the CSRs, phase-II generates CRs from the CSRs of phase-I and the CRs of phase-II help to extract and fine tune the CSRs in the next run of phase-I, which is possible because of the cyclic nature of BP M OGA. It is doing Bloat control, i.e. limiting the number of CSRs in CRs. Features of the data sets are taking part in CSRs formation probabilistically. We accommodate both the Michigan and the Pittsburgh approaches to develop BP M OGA considering the merits of both the 39

40

id HIDER U CS GAssist − ADI GAssist − Intervalar CORE ILGA BioHEL BP M OGA S+ NS+ NS+ SSS+ S+

HIDER NSNSSSNSNS+

U CS 0.01684 NSSSNS+ NS+

GAssist − ADI 0.41222 0.24604 SSNS+ S+

GAssist − Intervalar 0.5287 0.07346 0.1443 NSS+ S+

CORE 0.00452 0.00054 0.00024 0.00042 S+ S+

ILGA 0.00714 0.0001 0 0.0001 0.76418 NS+

BioHEL 0.0218 0.6389 0.4979 0.2172 0.0037 0.0008

BP M OGA 0.0041 0.6358 0.108 0.0221 0.0006 0.00005 0.3953

Table 12: Pairwise comparison of the performance of evolutionary classifier by means of a Wilcoxon signed-rank test on the test data sets.

41

id M LP − BP C4.5 rules IB3 NB EF S − RP S CN N N SLV BP M OGA S+ S+ S+ S+ S+ S+ S+

M LP − BP SNSNS+ NSNS+ NS+

C4.5 rules 0.0003 NS+ S+ S+ S+ S+

IB3 0.0271 0.00906 NS+ NSNS+ S+

NB 0.00104 0.1902 0.23014 NSNSNS+

EF S − RP S 0.00038 0.35238 0.0 0.28914 NS+ NS+

CN N 0.00022 0.71138 0.0041 0.8493 0.30302 NS+

N SLV 0.0001 0.72786 0.00374 0.2187 0.35758 0.87288

BP M OGA 0.00007 0.1396 0.00215 0.013 0.76418 0.26269 0.45326

Table 13: Pairwise comparison of the performance of non-evolutionary classifier by means of a Wilcoxon signed-rank test on the test data sets.

1175

1180

1185

1190

1195

1200

1205

methods and minimize the drawbacks. The BP M OGA based classifier can extract CSRs for different classes simultaneously, does not need any user specified threshold value. BP M OGA deals with different types of features (continuous and categorical) and handles the missing values of the features efficiently. In the present scope, at the time of rule creation and selection we give utmost importance to confidence in order to develop an accurate classifier. We carried out a comprehensive experiment and analysis while making performance comparison with seven evolutionary crisp rule learners and seven non-evolutionary learners. Analysis of the algorithms depict that the BP M OGA based classifier is highly competitive to both groups of learners. IP of BP M OGA is not depending upon the number of features of data sets. It is proportional to the number of data points of the training data sets. The intermediate population does not depend upon the number of features of data sets. BP M OGA has several mechanisms to control the size of the population. It is also selecting better features are eliminating the worse one. So its population size is comparable with other M OGA based approaches. High cost of objective function evaluations is a limitation of any GA based approach as well as BP M OGA. Hence, these are not suitable for very large data sets. Nowadays, this is not a problem as several mechanisms are available to handle big data. As we obtain the results by running the algorithms using KEEL, run time comparison is not tabulated here. However, comparative time complexity analysis of BP M OGA in terms of function points with other algorithms can be a good prospective future work. In future we plan to do the simulation studies by analyzing the effects with varying the levels of noise and simulations with different feature types or missing values. Here we have used GA on static data sets. If the GA can be made scalable and adaptable it will become an efficient and effective tool for mining large data sets. In (Vivekanandan and Nedunchezhian, 2011), researchers have used GA for classification on data steams. In future, we can modify BP M OGA similarly. Results are summarized in the SW OT (Strengths, Weaknesses, Opportunities, and Threats) analysis presented in Table 14, where strengths represent the main advantage of BP M OGA based classifier, weaknesses reveal its drawbacks, opportunities outline some suggested further lines of investigation, and threats include some optional approaches considered by other methods (discussed in section 1) that could compete with our proposal. Compliance with Ethical Standards

1210

Funding This study was not funded by any profit or nonprofit organization. Conflict of interest All authors declare that they have no conflicts of interest. 42

Table 14: SW OT analysis of BP M OGA based classifier Strengths -It shows a high performance in terms of classification accuracy, which is comparable with the state-of-the-art classifiers -It extracts optimized CSRs and CRs by a global optimizer (GA) -It does Bloat control, i.e. controlling the number of CSRs in CR -It can deal with missing feature values -It can deal with continuous and categorical features Opportunities -It may be applied in data streams -It can be modified to create fuzzy rules to make the representative rule more readable -It can be modified to create more general and more simpler classifier -In future BP M OGA can be used to solve other optimization problems

Weaknesses -It cannot do incremental learning, so not capable to extract rules from large data sets -Here classification decision boundaries are always axis-parallel -Computational complexity of BP M OGA may be higher than some classification algorithms

Threats -Sometimes single-phased GAs are capable to extract generalized CRs with high accuracies

Human and animal rights 1215

This article does not contain any studies with human participants or animals performed by any of the authors. References

1220

Aguilar-Ruiz, J. S., Gir´ aldez, R., and Riquelme, J. C. (2007). Natural encoding for evolutionary supervised learning. IEEE Transactions on Evolutionary Computation, 11(4):466–479. Aguilar-Ruiz, J. S., Riquelme, J. C., and Toro, M. (2003). Evolutionary learning of hierarchical decision rules. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 33(2):324–331.

1225

Aha, D. W., Kibler, D., and Albert, M. K. (1991). Instance-based learning algorithms. Machine learning, 6(1):37–66. Alcal´ a-Fdez, J., S´ anchez, L., Garcia, S., del Jesus, M. J., Ventura, S., Garrell, J. M., Otero, J., Romero, C., Bacardit, J., Rivas, V. M., et al. (2008). KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing, 13(3):307–318.

1230

Asuncion, A. and Newman, D. (2007). UCI machine learning repository. Bacardit, J., Burke, E. K., and Krasnogor, N. (2009). Improving the scalability of rule-based evolutionary learning. Memetic Computing, 1(1):55–67.

1235

Bacardit, J. and Garrell, J. M. (2003). Evolving multiple discretizations with adaptive intervals for a Pittsburgh rule-based learning classifier system. In Genetic and Evolutionary Computation Conference, pages 1818–1831. Springer.

43

Bacardit, J. and Garrell, J. M. (2004). Analysis and improvements of the adaptive discretization intervals knowledge representation. In Genetic and Evolutionary Computation Conference, pages 726–738. Springer. 1240

1245

Bacardit, J. and Garrell, J. M. (2007). Bloat control and generalization pressure using the minimum description length principle for a Pittsburgh approach learning classifier system. In Learning Classifier Systems, pages 59–79. Springer. Bacardit, J., Stout, M., Hirst, J. D., Sastry, K., Llor` a, X., and Krasnogor, N. (2007). Automated alphabet reduction method with evolutionary algorithms for protein structure prediction. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 346–353. ACM. Bandyopadhyay, S., Murthy, C., and Pal, S. K. (2000). VGA-classifier: design and applications. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 30(6):890–895.

1250

1255

Bandyopadhyay, S., Pal, S. K., and Aruna, B. (2004). Multiobjective GAs, quantitative indices, and pattern classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 34(5):2088–2099. Bayardo Jr, R. J. and Agrawal, R. (1999). Mining the most interesting rules. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 145–154. ACM. Bernad´ o, E. (2002). Contributions to genetic based classifier systems. PhD thesis, Enginyeria i Arquitectura la Salle, Ramon Llull University, Barcelona.

1260

Bernad´ o-Mansilla, E. and Garrell-Guiu, J. M. (2003). Accuracy-based learning classifier systems: models, analysis and applications to classification tasks. Evolutionary computation, 11(3):209–238. Booker, L. B. (1982). Intelligent behavior as an adaptation to the task environment.

1265

Castellanos-Garz´ on, J. A., Costa, E., Corchado, J. M., et al. (2019). An evolutionary framework for machine learning applied to medical data. KnowledgeBased Systems, page 104982. Castillo, L., Gonz´ alez, A., and P´erez, R. (2001). Including a simplicity criterion in the selection of the best rule in a genetic fuzzy learning algorithm. Fuzzy Sets and Systems, 120(2):309–321.

1270

Chan, Y.-H., Chiang, T.-C., and Fu, L.-C. (2010). A two-phase evolutionary algorithm for multiobjective mining of classification rules. In Proceedings of the 2010 Congress on Evolutionary Computation. CEC’10, pages 1–7. IEEE. Chiu, C. and Hsu, P.-L. (2005). A constraint-based genetic algorithm approach for mining classification rules. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 35(2):205–220. 44

1275

1280

Corcoran, A. L. and Sen, S. (1994). Using real-valued genetic algorithms to evolve rule sets for classification. In Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence, pages 120–124. IEEE. De Jong, K. (1988). Learning with genetic algorithms: An overview. Machine learning, 3(2-3):121–138. De Jong, K. A., Spears, W. M., and Gordon, D. F. (1993). Using genetic algorithms for concept learning. Machine learning, 13(2-3):161–188.

1285

1290

de la Iglesia, B., Reynolds, A., and Rayward-Smith, V. J. (2005). Developments on a multi-objective metaheuristic (MOMH) algorithm for finding interesting sets of classification rules. In International Conference on Evolutionary MultiCriterion Optimization, pages 826–840. Springer. de la Iglesia, B., Richards, G., Philpott, M. S., and Rayward-Smith, V. J. (2006). The application and effectiveness of a multi-objective metaheuristic algorithm for partial classification. European Journal of Operational Research, 169(3):898–917. Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, 6(2):182–197.

1295

Dehuri, S. and Mall, R. (2006). Predictive and comprehensible rule discovery using a multi-objective genetic algorithm. Knowledge-Based Systems, 19(6):413–421. Dehuri, S., Patnaik, S., Ghosh, A., and Mall, R. (2008). Application of elitist multi-objective genetic algorithm for classification rule generation. Applied Soft Computing, 8(1):477–487.

1300

Derrac, J., Verbiest, N., Garc´ıa, S., Cornelis, C., and Herrera, F. (2013). On the use of evolutionary feature selection for improving fuzzy rough set based prototype selection. Soft Computing, 17(2):223–238. Domingos, P. and Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine learning, 29(2-3):103–130.

1305

1310

Dutta, D. and Dutta, P. (2011). A real coded MOGA for mining classification rules with missing attribute values. In Proceedings of the 2011 International Conference on Communication, Computing & Security, ICCCS ’11, pages 355–358. ACM. Fern´ andez, A., Garc´ıa, S., Luengo, J., Bernad´ o-Mansilla, E., and Herrera, F. (2010). Genetics-based machine learning for rule induction: state of the art, taxonomy, and comparative study. IEEE Transactions on Evolutionary Computation, 14(6):913–941.

45

1315

Fern´ andez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15(1):3133–3181. Fidelis, M. V., Lopes, H. S., and Freitas, A. A. (2000). Discovering comprehensible classification rules with a genetic algorithm. In Proceedings of the 2000 Congress on Evolutionary Computation. CEC’00, volume 1, pages 805–810. IEEE.

1320

Fisher, R. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society. Series B (Methodological), pages 69–78. Fourman, M. P. (1985). Compaction of symbolic layout using genetic algorithms. In Genetic Algorithms and Their Applications: Proc. 1st Int. Conf. Genetic Algorithms, Princeton, Lawrence Erlbaum, NJ, 1985.

1325

Franco, M. A., Krasnogor, N., and Bacardit, J. (2013). gassist vs. biohel: critical assessment of two paradigms of genetics-based machine learning. Soft Computing, 17(6):953–981. Freitas, A. A. (1999). A genetic algorithm for generalized rule induction. In Advances in Soft Computing, pages 340–353. Springer.

1330

1335

Freitas, A. A. (2003). A survey of evolutionary algorithms for data mining and knowledge discovery. In Advances in evolutionary computing, pages 819–845. Springer. Freitas, A. A. (2004). A critical review of multi-objective optimization in data mining: a position paper. ACM SIGKDD Explorations Newsletter, 6(2):77– 86. Freitas, A. A. (2013). Data mining and knowledge discovery with evolutionary algorithms. Springer Science & Business Media.

1340

Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association, 32(200):675–701. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86–92. Gao, Q.-B. and Wang, Z.-Z. (2007). Center-based nearest neighbor classifier. Pattern Recognition, 40(1):346–349.

1345

1350

Garc´ıa, D., Gonz´ alez, A., and P´erez, R. (2014). Overview of the slave learning algorithm: A review of its evolution and prospects. International Journal of Computational Intelligence Systems, 7(6):1194–1221. Gorzalczany, M. B. and Rudzi´ nski, F. (2016). A multi-objective genetic optimization for fast, fuzzy rule-based credit classification with balanced accuracy and interpretability. Applied Soft Computing, 40:206–220. 46

Greene, D. P. and Smith, S. F. (1993). Competition-based induction of decision models from examples. Machine Learning, 13(2-3):229–257.

1355

Guan, S.-U. and Zhu, F. (2005). An incremental approach to genetic-algorithmsbased classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 35(2):227–239. G¨ undo˘ gan, K. K., Alata¸s, B., and Karci, A. (2004). Mining classification rules by using genetic algorithms with non-random initial population and uniform operator. Turkish Journal of Electrical Engineering & Computer Sciences, 12(1):43–52.

1360

Hanley, J. P., Rizzo, D. M., Buzas, J. S., and Eppstein, M. J. (2019). A tandem evolutionary algorithm for identifying causal rules from complex data. Evolutionary computation, pages 1–32. Holland, J. H. (1976). Progress in theoretical biology in r. rosen and f. m. snell editors.

1365

Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press. Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P. R. (1989). Induction: Processes of inference, learning, and discovery. MIT press.

1370

1375

Holland, J. H. and Reitman, J. S. (1978). Cognitive systems based on adaptive algorithms. In Pattern-directed inference systems, pages 313–329. Elsevier. Ishibuchi, H., Kuwajima, I., and Nojima, Y. (2008). Evolutionary multiobjective rule selection for classification rule mining. In Multi-Objective Evolutionary Algorithms for Knowledge Discovery from Databases, pages 47–70. Springer. Ishibuchi, H. and Nojima, Y. (2007). Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. International Journal of Approximate Reasoning, 44(1):4–31.

1380

Ishibuchi, H., Nozaki, K., Yamamoto, N., and Tanaka, H. (1994). Construction of fuzzy classification systems with rectangular fuzzy rules using genetic algorithms. Fuzzy sets and systems, 65(2-3):237–253. Ishibuchi, H., Nozaki, K., Yamamoto, N., and Tanaka, H. (1995). Selecting fuzzy if-then rules for classification problems using genetic algorithms. IEEE Transactions on fuzzy systems, 3(3):260–270.

1385

Ishibuchi, H. and Yamamoto, T. (2004). Fuzzy rule selection by multi-objective genetic local search algorithms and rule evaluation measures in data mining. Fuzzy sets and systems, 141(1):59–88.

47

1390

Ishibuchi, H., Yamamoto, T., and Nakashima, T. (2005). Hybridization of fuzzy GBML approaches for pattern classification problems. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 35(2):359–365. Janikow, C. Z. (1993). A knowledge-intensive genetic algorithm for supervised learning. In Genetic Algorithms for Machine Learning, pages 33–72. Springer.

1395

Jutler, H. (1967). Liniejnaja modiel z nieskolkimi celevymi funkcjami (linear model with several objective functions). Ekonomika i matematiceckije Metody, 3:397–406. Kaya, M. (2010). Autonomous classifiers with understandable rule using multiobjective genetic algorithms. Expert Systems with Applications, 37(4):3489– 3494.

1400

Langdon, W. B. (1997). Fitness causes bloat in variable size representations. Technical Report CSRP-97-14, Michigan State University, East Lansing, Michigan. Lei, Y., Zuo, M. J., He, Z., and Zi, Y. (2010). A multidimensional hybrid intelligent method for gear fault diagnosis. Expert Systems with Applications, 37(2):1419–1430.

1405

Mansoori, E. G., Zolghadri, M. J., and Katebi, S. D. (2008). SGERD: A steadystate genetic algorithm for extracting fuzzy classification rules from data. IEEE Transactions on Fuzzy Systems, 16(4):1061–1071. Michalewicz, Z. and Hartley, S. J. (1996). Genetic algorithms+ data structures= evolution programs. Mathematical Intelligencer, 18(3):71.

1410

1415

Minnaert, B., Martens, D., De Backer, M., and Baesens, B. (2015). To tune or not to tune: rule evaluation for metaheuristic-based sequential covering algorithms. Data mining and knowledge discovery, 29(1):237–272. Mukhopadhyay, A., Maulik, U., Bandyopadhyay, S., and Coello, C. A. C. (2014a). A survey of multiobjective evolutionary algorithms for data mining: Part I. IEEE Transactions Evolutionary Computation, 18(1):4–19. Mukhopadhyay, A., Maulik, U., Bandyopadhyay, S., and Coello, C. A. C. (2014b). Survey of multiobjective evolutionary algorithms for data mining: Part II. IEEE Transactions on Evolutionary Computation, 18(1):20–35.

1420

1425

Narukawa, K., Nojima, Y., and Ishibuchi, H. (2005). Modification of evolutionary multiobjective optimization algorithms for multiobjective design of fuzzy rule-based classification systems. In Fuzzy Systems, 2005. FUZZ’05. The 14th IEEE International Conference on, pages 809–814. IEEE. Nazmi, S., Razeghi-Jahromi, M., and Homaifar, A. (2017). Multilabel classification with weighted labels using learning classifier systems. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 275–280. IEEE. 48

Nguyen, T., Khosravi, A., Creighton, D., and Nahavandi, S. (2015). Classification of healthcare data using genetic fuzzy logic system and wavelets. Expert Systems with Applications, 42(4):2184–2197. 1430

1435

Noda, E., Freitas, A. A., and Lopes, H. S. (1999). Discovering interesting prediction rules with a genetic algorithm. In Proceedings of the 1999 Congress on Evolutionary Computation. CEC’99, volume 2, pages 1322–1329. IEEE. Orriols-Puig, A., Casillas, J., and Bernad´ o-Mansilla, E. (2009). Fuzzy-ucs: a michigan-style learning fuzzy-classifier system for supervised learning. IEEE transactions on evolutionary computation, 13(2):260–283. Pal, S. K., Bandyopadhyay, S., and Murthy, C. (1998). Genetic algorithms for generation of class boundaries. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 28(6):816–828.

1440

Pareto, V. (1972). Manual of political economy [manuale di economia politica].(as schwier, transactions).(edited by as schwier and an page). Quinlan, J. R. (2014). C4.5: programs for machine learning. Elsevier. Rojas, R. (2013). Neural networks: a systematic introduction. Springer Science & Business Media.

1445

Rudzi´ nski, F. (2016). A multi-objective genetic optimization of interpretabilityoriented fuzzy rule-based classifiers. Applied Soft Computing, 38:118–133. Ryerkerk, M., Averill, R., Deb, K., and Goodman, E. (2019). A survey of evolutionary algorithms using metameric representations. Genetic Programming and Evolvable Machines, pages 1–38.

1450

1455

Sanz, J. A., Bernardo, D., Herrera, F., Bustince, H., and Hagras, H. (2015). A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Transactions on Fuzzy Systems, 23(4):973–990. Sanz, J. A., Fernandez, A., Bustince, H., and Herrera, F. (2013). IVTURS: A linguistic fuzzy rule-based classification system based on a new interval-valued fuzzy reasoning method with tuning and rule selection. IEEE Transactions on Fuzzy Systems, 21(3):399–411. Sanz, J. A., Galar, M., Jurio, A., Brugos, A., Pagola, M., and Bustince, H. (2014). Medical diagnosis of cardiovascular diseases using an interval-valued fuzzy rule-based classification system. Applied Soft Computing, 20:103–111.

1460

Schaffer, J. D. (1984). Some experiments in machine learning using vector evaluated genetic algorithms (artificial intelligence, optimization, adaptation, pattern recognition).

49

1465

1470

Schaffer, J. D. (1985). Multiple objective optimization with vector evaluated genetic algorithms. In Proceedings of the First International Conference on Genetic Algorithms and Their Applications, 1985. Lawrence Erlbaum Associates. Inc., Publishers. Serdio, F., Lughofer, E., Zavoianu, A.-C., Pichler, K., Pichler, M., Buchegger, T., and Efendic, H. (2017). Improved fault detection employing hybrid memetic fuzzy modeling and adaptive filters. Applied Soft Computing, 51:60– 82. Setnes, M. and Roubos, H. (2000). GA-fuzzy modeling and classification: complexity and performance. IEEE transactions on Fuzzy Systems, 8(5):509–522. Sigaud, O. and Wilson, S. W. (2007). Learning classifier systems: a survey. Soft Computing, 11(11):1065–1078.

1475

Smith, S. F. (1980). A learning system based on genetic adaptive algorithms. Tan, K. C., Yu, Q., and Ang, J. H. (2006). A coevolutionary algorithm for rules discovery in data mining. International Journal of Systems Science, 37(12):835–864.

1480

1485

Urbanowicz, R. J., Granizo-Mackenzie, A., and Moore, J. H. (2012). An analysis pipeline with statistical and visualization-guided knowledge discovery for michigan-style learning classifier systems. IEEE computational intelligence magazine, 7(4):35–45. Urbanowicz, R. J. and Moore, J. H. (2009). Learning classifier systems: a complete introduction, review, and roadmap. Journal of Artificial Evolution and Applications, 2009:1. Venturini, G. (1993). SIA: a supervised inductive algorithm with genetic search for learning attributes based concepts. In European conference on machine learning, pages 280–296. Springer.

1490

Vivekanandan, P. and Nedunchezhian, R. (2011). Mining data streams with concept drifts using genetic algorithm. Artificial Intelligence Review, 36(3):163– 178. Weijters, T. and Paredis, J. (2002). Genetic rule induction at an intermediate level. Knowledge-Based Systems, 15(1-2):85–94.

1495

Weiss, S. M. and Kulikowski, C. A. (1991). Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann Publishers Inc. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80–83.

1500

Wilson, S. W. (1986). Knowledge growth in an artificial animal. In Adaptive and Learning Systems, pages 255–264. Springer. 50

Wilson, S. W. (1994). ZCS: A zeroth level classifier system. Evolutionary computation, 2(1):1–18. Wilson, S. W. (1995). Classifier fitness based on accuracy. Evolutionary computation, 3(2):149–175. 1505

1510

Xia, H., Zhuang, J., and Yu, D. (2013). Novel soft subspace clustering with multi-objective evolutionary approach for high-dimensional data. Pattern Recognition, 46(9):2562–2575. Zhong-Yang, X., Lei, Z., and Yu-Fang, Z. (2004). A classification rule mining method using hybrid genetic algorithms. In TENCON 2004. 2004 IEEE Region 10 Conference, pages 207–210. IEEE.

51

Start Data set

Preprocessing

Generation of phase-I = 1

Population initialization

Generate intermediate population from data set

Crossover Change points

Generate intermediate of population from CSRs

Mutation

No

Combination

Chromosome filtering

Yes

Computation of objective values

Selection of Pareto chromosoems

Generation of phase-I = Generation of phase-I + 1

Generation of phase-I < 50

Yes

No CSRs

End

Figure 1: Flowchart of phase-I of BP M OGA

52

Generation of phase-I = Odd number

Figure 2: Hybrid Coding

Figure 3: Chromosome with all features

Figure 4: Chromosome with continuous feature

Figure 5: Chromosome with categorical feature

Parent Chromosome 1

Crossover Point ↓ 100.0 98.95

1

1

0

Parent Chromosome 2

Crossover Point ↓ 99.65 98.95

1

1

1

Child Chromosome 1

98.95

99.65

1

1

1

Child Chromosome 2

98.95

100.0

1

1

0

Figure 6: Crossover at gene of continuous feature

53

Parent Chromosome 1

98.95

100.0

Crossover Point ↓ 1 1

0

1

Parent Chromosome 2

98.95

99.65

Crossover Point ↓ 1 1

Child Chromosome 1

98.95

100.0

1

1

1

Child Chromosome 2

98.95

99.65

1

1

0

Figure 7: Crossover at gene of categorical feature

98.95

Crossover Point ↓ 100.0 1

1

Parent Chromosome 2

98.95

Crossover Point ↓ 99.65

#

Child Chromosome 1

98.95

100.0

#

Child Chromosome 2

98.95

99.65

Parent Chromosome 1

1

0

1

0

Figure 8: Crossover at gene of categorical feature with do not care condition

Figure 9: Mutation of gene of continuous feature

Parent Chromosome

Child Chromosome

98.95

98.95

100.0

99.9

1

mutation point ↓ 1

0

1

mutation point ↓ 0

0

Figure 10: Mutation of gene of categorical feature

54

Figure 11: Modified chromosome with continuous feature

55

CSRs

Start

Sorting of CSRs

Generation of phase-II = 1

Sorted CSRs

Initialization of population

Crossover

Generation of Intermediate Population by initialization of population Generation of Intermediate Population from CRs

Mutation

Combination

Boat control

No

Chromosome filtering Yes Evaluation of objective values

Selection of Pareto Chromosoems

Generation of phase-II = Generation of phase-II+1

Generation of phase-II<50

No

Yes CRs

End

56

Figure 12: Flowchart of phase-II

Generation of phase-II = odd number

Start

1st run of phase-I

1st run of phase-II

Sorting of CSRs

Extraction of CSRs from CRs

2nd run of phase-I

2nd run of phase-II

Sorting of CSRs

Combination

Eliminating Duplicate CRs

Selection of Pareto CRs

Extraction of CSRs from CRs

Selection of Pareto CSRs

3rd run of phase-I

Figure 13: Combinations of phase-I and II of BP M OGA

57

58

0

0.2

CSR12

For Class 1

Data set

0

0.2

0.4

0.6

CSR11

CSR14

0.8

CSR15

Phase-I

0.4 0.6 Coverage

CSR13

Accuracy

Accuracy

1

1

0

0.2

0.4

0

CSR23

1 1

0

0.2

0.4

0

0.8 CSR 31 0.6

Phase-I I

CSR32

0

CR1 CR3

0.8

CSR35

CSR33 CSR34

1

CR4

0.8

CR5 1

Selected CSRs from CRs

0.4 0.6 Coverage Complete Rules

0.2

CR2

0.4 0.6 Coverage For Class 3

0.2

0

0.2

0.4

0.6

0.8

1

Figure 14: The process or rule selection by BP M OGA

CSR25 0.8

CSR24 0.4 0.6 Coverage For Class 2

0.2

0.8 CSR21 CSR22 0.6

1

Accuracy

0.8 Accuracy

AUTHORSHIP STATEMENT Manuscript title: A bi-phased multi-objective genetic algorithm based classifier All persons who meet authorship criteria are listed as authors, and all authors certify that they have participated sufficiently in the work to take public responsibility for the content, including participation in the concept, design, analysis, writing, or revision of the manuscript. Furthermore, each author certifies that this material or similar material has not been and will not be submitted to or published in any other publication before its appearance in the Hong Kong Journal of Occupational Therapy. Authorship contributions Please indicate the specific contributions made by each author (list the authors’ initials followed by their surnames, e.g., Y.L. Cheung). The name of each author must appear at least once in each of the three categories below. Category 1 Conception and design of study: D. Dutta, J. Sil, P. Dutta; acquisition of data: D. Dutta; analysis and/or interpretation of data: D. Dutta. Category 2 Drafting the manuscript: D. Dutta, J. Sil, P. Dutta; revising the manuscript critically for important intellectual content: : D. Dutta, J. Sil, P. Dutta; Category 3 Approval of the version of the manuscript to be published (the names of all authors must be listed): D. Dutta, J. Sil, P. Dutta;