INS 10940
No. of Pages 19, Model 3G
23 June 2014 Information Sciences xxx (2014) xxx–xxx 1
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins 5 6
A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers
3 4 7
Q1
8
Dipartimento di Ingegneria dell’Informazione, University of Pisa, 56122 Pisa, Italy
9
a r t i c l e
1 2 1 3 12 13 14 15 16 17 18 19 20 21 22
Michela Antonelli ⇑, Pietro Ducange, Francesco Marcelloni
i n f o
Article history: Received 20 January 2013 Received in revised form 28 May 2014 Accepted 9 June 2014 Available online xxxx
Q2
Keywords: Multi-objective evolutionary fuzzy system Fuzzy rule-based classifier Evolutionary rule selection Evolutionary condition selection
a b s t r a c t During the last years, multi-objective evolutionary algorithms (MOEAs) have been extensively used to generate fuzzy rule-based systems characterized by different trade-offs between accuracy and complexity. In this paper, we propose an MOEA-based approach to learn concurrently the rule and data bases of fuzzy rule-based classifiers (FRBCs). In particular, the rule bases are generated by exploiting a rule and condition selection (RCS) strategy, which selects a reduced number of rules from a heuristically generated set of candidate rules and a reduced number of conditions for each selected rule during the evolutionary process. RCS can be considered as a rule learning in a constrained search space. As regards the data base learning, the membership function parameters of each linguistic term used in the rules are learned concurrently to the application of RCS. We tested our approach on twenty-four classification benchmarks and compared our results with the ones obtained by two similar state-of-the-art MOEA-based approaches and by two well-known non-evolutionary classification algorithms, namely FURIA and C4.5. Using non-parametric statistical tests, we show that our approach generates FRBCs with accuracy and complexity statistically comparable to, and sometimes better than, the ones generated by the two MOEA-based approaches, exploiting, however, only the 5% of the number of fitness evaluations used by these approaches. Further, the classifiers generated by our approach result to be more interpretable than the ones generated by the FURIA and C4.5 algorithms, while achieving the same accuracy level. Ó 2014 Published by Elsevier Inc.
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
46 47
1. Introduction
48
Fuzzy rule-based systems (FRBSs) have proved to be a very powerful and effective tool in classification problems [35,39]. Fuzzy Rule-Based Classifiers (FRBCs) have been widely exploited in several engineering applications, such as intrusion detection [47], thermography-based breast cancer diagnosis [45], cardiac arrhythmia classification [11], and ground vehicles classification from acoustic features [51]. This success is mainly due to the intrinsic nature of FRBCs, which allows them to deal with vague and noisy data and to explain how the classification task is performed [29]. A number of techniques have been proposed to generate and optimize the structure of FRBCs, which consists of a rule base (RB) and a data base (DB) [15,26,38,40]. These approaches mainly aim to maximize the FRBC capability of correctly classifying the input patterns, without taking into consideration how this maximization affects the FRBC interpretability in terms of RB complexity (i.e. huge number of rules and conditions) and DB integrity (i.e. poor comprehensibility and possible
49 50 51 52 53 54 55 56
⇑ Corresponding author. Tel.: +39 0502217684; fax: +39 0502217600.
Q1
E-mail address:
[email protected] (M. Antonelli). http://dx.doi.org/10.1016/j.ins.2014.06.014 0020-0255/Ó 2014 Published by Elsevier Inc.
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
2
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
overlapping of the fuzzy sets in the partitions of the input linguistic variables). For this reason, these approaches have been also labeled as accuracy-oriented design methods. On the other hand, during the last decade, researchers have also focused their attention on the interpretability aspects of FRBSs in general and of FRBCs in particular [5,6,25,41,52]. Since accuracy and interpretability are conflicting objectives, the generation of the FRBS structure has been modeled as a multi-objective optimization problem. Multi-objective evolutionary algorithms (MOEAs) have been successfully employed to tackle this optimization problem with the main aim of generating sets of FRBSs characterized by different trade-offs between accuracy and interpretability. In fact, the term multi-objective evolutionary fuzzy system (MOEFS) has been coined [21,22,27] to identify FRBSs generated by MOEAs. While the accuracy objective has been typically measured in terms of classification rate and approximation error for classification and regression problems, respectively, a number of specific measures has been proposed for evaluating the interpretability, taking the RB complexity and DB integrity into account [25]. A large number of contributions have been recently published under the framework of MOEFSs, with application mostly to regression [1,2,7–10,13,14,17,24,43] and classification [3,19,20,33,36,37,42,46] problems. Recently, some taxonomies of the main contributions have been also introduced [21,22]. In this paper, we propose a fast and efficient MOEFS approach to concurrently perform rule base learning and membership function (MF) parameters learning of a set of FRBCs with different trade-offs between accuracy and interpretability. Rule base learning has been mainly performed by learning rules from scratch [1,7–9,14,20,36,42,43] or by selecting rules from an initial set of candidate rules [3,24,33,37]. We denote the two approaches as rule learning (RL) and rule selection (RS). Further, there exist also embedded approaches that evolve the DB during the evolutionary process and then exploit some heuristic for generating the RB whenever the objectives have to be evaluated during the evolutionary process [2,19]. RL schemes learn rules from scratch, usually using an integer chromosome to codify, for each condition in a rule, the index of the fuzzy set selected for the corresponding linguistic variable. Obviously, the dimensions of the chromosome and consequently of the search space grow with the increase of the number of input variables. Thus, when this number is high, the MOEA needs a large amount of evaluations to adequately explore the search space and therefore achieve good solutions. On the other hand, RS schemes select rules from a set of candidate rules. They exploit a binary chromosome coding where each gene denotes a rule selected from this set that is heuristically generated. A prescreening procedure is usually executed to remove too complex rules. For example, in [37] the set of candidate rules is generated by using a specific heuristic based on multiple granularities: different partitions of the linguistic variables can be used for different rules. Further, a prescreening procedure is applied on the basis of the confidence and support of the generated rules. Recently, the work discussed in [37] has been extended in [3]: the process of generating the set of candidate rules includes a quite computationally heavy method to find the specific granularity for each linguistic variable and only rules with at most two or three conditions, depending on the number of features, are considered as candidate rules. During the evolutionary process, RS is performed together with the learning of the MF parameters. Theoretically, the RL scheme can generate solutions with better trade-offs between accuracy and interpretability than solutions generated by the RS scheme, despite, however, a higher number of evaluations. Indeed, the search space in the RL scheme is larger than in the RS scheme. Hence, although potentially the RL scheme can outperform the RS scheme, in terms of trade-offs between accuracy and complexity, often the designer is obliged to limit the number of evaluations in order to obtain satisfactory solutions in a reasonable time. In this paper, in order to exploit the potentialities of the RL scheme in the multi-objective evolutionary generation of FRBCs, without worrying about its computational complexity, we learn the rules in a constrained search space. In particular, the learning process is performed by selecting a set of rules from the set of candidate rules and a set of conditions for each selected rule. We denote this hybrid scheme as rule and condition selection (RCS). To generate the set of candidate rules, we pre-process the training set by transforming each continuous variable into a categorical and ordered variable. To this aim, we exploit a pre-defined fuzzy partition of each input variable. Then, we apply the well-known C4.5 algorithm [44] to the transformed training set and generate a decision tree. Finally, we extract the set of candidate fuzzy rules from the decision tree: each rule corresponds to each path from the root to a leaf node. Since the C4.5 algorithm performs feature selection while creating the decision tree, we can obtain a quite simple set of candidate rules that, however, is a very good starting point for exploring the search space. During the multi-objective evolutionary process, we generate the RBs of the FRBCs by using the RCS approach and concurrently learn the MF parameters of the linguistic terms used in the rules. We measure accuracy and interpretability in terms of percentage of correct classification and total number of antecedent conditions of the rules in the RB, respectively. Another MOEFS previously proposed in the literature exploits the C4.5 algorithm for generating a set of candidate rules [42]. Unlike our approach, however, the C4.5 algorithm is applied directly to the continuous input variables: the partitions are generated directly by the algorithm while creating the decision tree. The partitioning process is only based on the information gain and therefore might generate fuzzy sets that do not have a real semantic. The overall set of candidate rules is used to generate the first solution. The rest of the population is created by randomly replacing some parameters of this solution by random numbers. The NSGA-II algorithm is therefore applied to fine-tune the parameters of the fuzzy sets and to find the appropriate rules and rule conditions by using an RL scheme. Only the relevant variables, selected by the C4.5 algorithm, are used to form the fuzzy rules. We have applied our approach to twenty-four classification datasets and the results have been compared with those obtained by two state-of-the-art multi-objective evolutionary approaches to FRBC generation [3]. We have evaluated both the effectiveness and quality of the evolutionary processes by using two well-known quality indicators, namely Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
3
130
hypervolume and epsilon-dominance, and the generalization capabilities of the generated solutions by employing the classification rate. Using non-parametric statistical tests, we have shown that our approach generates FRBCs with accuracy and total number of antecedent conditions statistically comparable to, and sometimes better than, the ones generated by the two MOEA-based approaches, exploiting, however, only the 5% of the number of fitness evaluations used by the original versions discussed in [3]. Further, we have compared the performance of the FRBCs generated by our approach with those obtained by two well-known classification algorithms, namely FURIA [31] and C4.5 [44]. The classifiers generated by the FURIA and C4.5 algorithms are statistically equivalent, in terms of accuracy, to the most accurate FRBCs generated by the proposed approach. On the other hand, the classifiers generated by our approach are more interpretable than the ones generated by the FURIA and C4.5 algorithms. The paper is organized as follows. In Section 2, we provide a basic description of FRBCs and introduce some notations. Section 3 shows the proposed MOEA-based learning approach, including the details of the generation of the set of candidate rules, of the chromosome coding and mating operators, and of the adopted MOEA. In Section 4, we illustrate the experimental results and in Section 5 we draw some final conclusion.
131
2. Fuzzy rule-based classifiers
132
Pattern classification consists of assigning a class C j from a predefined set C ¼ fC 1 ; . . . ; C K g of classes to a pattern. We consider a pattern as an F-dimensional point in a feature space RF . Let X ¼ fX 1 ; . . . ; X F g be the set of input variables and U f ; f ¼ 1; . . . ; F, be the universe of discourse of the f th variable. Let Pf ¼ fAf ;1 ; . . . ; Af ;T f g; f ¼ 1; . . . ; F, be a fuzzy partition with T f fuzzy sets of the universe U f . The m-th rule Rm ; m ¼ 1; . . . ; M, of an FRBC is typically expressed as:
118 119 120 121 122 123 124 125 126 127 128 129
133 134 135
136 138
139 140 141 142 143 144 145 146 147 148 149 150 151
152
Rm : If X 1 is A1;jm;1 and . . . and X F is AF;jm;F then Y is C jm with RW m
where Y is the classifier output, C jm 2 C is the class label associated with the m-th rule, jm;f 2 ½1; T f identifies the index of the fuzzy set (among the T f fuzzy sets of the partition P f ), which has been selected for X f in the rule Rm , and RW m is the rule weight, i.e., a certainty degree of the classification in the class C jm for a pattern which fires the antecedent of the rule. To take the ‘‘don’t care’’ condition into account, a new fuzzy set Af ;0 ; f ¼ 1; . . . ; F, is added to all the F input partitions Pf . This fuzzy set is characterized by an MF equal to 1 on the overall universe. The terms Af ;0 allow generating rules that contain only a subset of the input variables. Let ðxt ; yt Þ be the t-th input–output pair, with xt ¼ ½xt;1 . . . ; xt;F 2 RF and yt 2 C. The strength of actiQ vation (matching degree of the rule with the input) of the rule Rm is calculated as wm ðxt Þ ¼ Ff¼1 Af ;jm;f ðxt;f Þ, and the association degree with the class C j m is calculated as hm ðxt Þ ¼ wm ðxt Þ RW m . We adopt the maximum matching method as reasoning method: an input pattern is classified into the class corresponding to the rule with the maximum association degree calculated for the pattern. In the case of tie, we classify the pattern with the class of the first rule in the chromosome that has the maximum association degree. In the last few years, different approaches have been proposed to calculate the value of the rule weight RW m [35]. In this paper, we adopt the certainty factor
P
xt 2C jm wm ðxt Þ
154
ð1Þ
RW m ¼ CF m ¼ PN
t¼1 wm ðxt Þ
ð2Þ
157
Once fixed the number T f of fuzzy sets for each linguistic variable, the reasoning method and the rule weight type, we adopt an MOEA-based approach to learn rules and MF parameters so as to generate a set of FRBCs with different trade-offs between accuracy and RB complexity.
158
3. The MOEA-based approach
159
168
The MOEA-based approach selects rules and conditions from a set of candidate rules by using the RCS method, and concurrently learns MF parameters of the fuzzy sets used in the conditions of the rules. By using the RCS approach, we exploit the advantages of RL (potentially better trade-offs between accuracy and complexity) and RS (reduced search space and therefore faster convergence) schemes. At the end of the execution of the MOEA, we have a set of FRBCs with different trade-offs between accuracy and RB complexity. To achieve this objective, we exploit an appropriate chromosome coding and properly defined mating operators. In particular, chromosome C is composed of two parts ðC RB ; C DB Þ, which define the RB and the MF parameters of the input variables, respectively. We apply both crossover and mutation operators to each part of the chromosome independently. In the following, we introduce the method to generate the set of candidate rules and discuss the RCS approach and the MF parameter learning, highlighting the corresponding chromosome coding and mating operators used in the multi-objective evolutionary process. Then, we introduce the MOEA used to generate the FRBCs.
169
3.1. Generation of the set of candidate rules
170
The generation of the set of candidate rules is a critical aspect of the RCS strategy. Indeed, this strategy results to be particularly effective and efficient when the set of candidate rules is compact and is already characterized by a good accuracy. In this paper, we generate the set of candidate rules by first pre-processing the training set transforming each continuous
155 156
160 161 162 163 164 165 166 167
171 172
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1 173 174 175 176 177
4
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
variable into a categorical and ordered variable. To this aim, we exploit a fuzzy partition of each input variable. Then, we apply the well-known C4.5 algorithm that generates a decision tree as output. Finally, we extract the set of candidate fuzzy rules from the decision tree. A similar approach was discussed in [16] with however a different aim, that is, performing feature selection for FRBCs. The following procedure describes in detail the steps used to generate the set of candidate rules. 1.
178 179 180 181
2.
182 183 184 185 186 187
3.
188 189 190 191 192 193 194 195 196
4.
Define an initial uniform fuzzy partition Pf ¼ fAf ;1 ; . . . ; Af ;T f g for each input variable X f ; f ¼ 1; . . . ; F. The number T f of fuzzy sets can be different from an input variable to another. For the sake of simplicity, in our experiments, we have used the same number of fuzzy sets for all the variables X f . Actually, our approach can manage different T f for different variables X f , if this information was available. Use the defined fuzzy partitions Pf to transform the continuous input variables X f , defined in R, to categorical and ordered variables: each category is a linguistic value corresponding to a fuzzy set in Pf . For simplicity, we will denote these linguistic values as ½1; . . . ; T f in the following. Actually, in a partition with five fuzzy sets, possible linguistic values might be very low, low, medium, high and very high. For each input variable X f , the category associated with each continuous value is determined by choosing the index of the fuzzy set of the partition Pf to which the value belongs at maximum grade; in case of tie, we choose randomly. Apply the classical C4.5 algorithm to the transformed training set. The output of the algorithm is a decision tree. We recall that a decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an input variable, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node. In our case, each branch is associated with one of the possible linguistic values of the input variable tested in the corresponding node. Extract the set of candidate rules from the decision tree. One rule is created for each path from the root to a leaf node. The rule antecedent (‘‘IF’’ part) is built by joining through the AND operator each splitting criterion along a given path. The leaf node holds the class prediction, forming the rule consequent (‘‘THEN’’ part). Since each branch is identified by a linguistic value and an input variable can be tested in only one node in a path, the rules extracted from the decision tree are expressed as in (1).
197
203
Fig. 1 shows an example of a decision tree generated by the C4.5 algorithm from a training set characterized by eight input variables and three classes (C1, C2, C3). Each input variable X f ; f ¼ 1; . . . ; 8 has been partitioned with T f ¼ 5 fuzzy sets. We observe that only two out of the eight original input variables are included in the decision tree. This is due to the wellknown characteristic of the C4.5 algorithm which can select features during the generation of the tree. Fig. 2 shows the RB extracted from the decision tree of Fig. 1. We note that the RB consists of nine rules, which correspond to the nine possible paths from the root to the leaf nodes.
204
3.2. The rule and condition selection strategy
205
Let J C4:5 and M C4:5 be the RB generated by applying the C4.5 algorithm to the data set and the number of rules of this RB, respectively. Especially when dealing with large and high dimensional datasets, the C4.5 algorithm could generate RBs composed of a high number of rules. For this reason, in order to generate compact and interpretable RBs, we allow that the RB contains at most M MAX rules. In the experiments, we have set M MAX ¼ 50. This value allows us to achieve a reasonable accuracy maintaining the complexity at an adequate level. Obviously, if the number M C4:5 of rules extracted from the decision tree is lower than M MAX ; M MAX is set to M C4:5 . Actually, in the twenty-four datasets used in the experiments only for five datasets we have set M MAX ¼ 50; for all the other datasets, M MAX ¼ M C4:5 . Thus, the C RB part of the chromosome is a vector of M MAX pairs pm ¼ ðkm ; v m Þ, where km 2 ½0; . . . ; M C4:5 identifies the index of the rule in J C4:5 selected for the current RB and v m ¼ ½v m;1 ; . . . ; v m;F is a binary vector which indicates, for each condition in the rule, if the condition is present or corresponds to a ‘‘don’t care’’. In particular, if km ¼ 0, the mth rule is not included in the RB. Thus, we can generate RBs with a lower number of rules than M MAX . Further, if v m;f ¼ 0, the f th condition of the mth rule is replaced by a ‘‘don’t care’’ condition; otherwise it remains unchanged. We observe that the rule weight is re-computed whenever a condition selection is performed on the rule.
198 199 200 201 202
206 207 208 209 210 211 212 213 214 215 216 217
Fig. 1. An example of decision tree generated by the C4.5 algorithm applied to the pre-processed training set.
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
5
Fig. 2. The fuzzy RB extracted from the decision tree shown in Fig. 1.
218 219
220
As an example, let us consider M MAX ¼ 3 and let us suppose to have a two-input fuzzy model. Let us assume that the C4.5 algorithm has generated the four rules described by the following matrix J C4:5 :
2
3 2 1
3
65 4 17 7 6 J C4:5 ¼ 6 7 41 4 25 2 2 2
222 223 224
225 227
ð3Þ
Let us assume that, during the evolutionary process, the C RB chromosome part shown in Fig. 3 is generated. Then, the corresponding RB will be represented by the following matrix J:
J¼
3 0 1
ð4Þ
0 2 2
229
We note that, even though M MAX ¼ 3, only two rules have been selected in the final RB. Further, the first and the second conditions have been selected for the first and the fourth rules, respectively.
230
3.3. The MF parameter learning
231
240
In order to perform the MF parameter learning for each input linguistic variable, concurrently to the RCS, we exploit a real coding for the second part of the chromosome. We adopt triangular fuzzy sets Af ;j defined by the tuple ðaf ;j ; bf ;j ; cf ;j Þ, where af ;j and cf ;j correspond to the left and right extremes of the support of Af ;j , and bf ;j to the core. Since we adopt strong fuzzy partitions with, for j ¼ 2; . . . ; T f 1; bf ;j ¼ cf ;j1 and bf ;j ¼ af ;jþ1 , each fuzzy set of the partition is completely defined by fixing the positions of the cores bf ;j along the universe U f of the variable (we normalize each variable in [0, 1]). Since bf ;1 and bf ;T f coincide with the extremes of the universe, the partition of each linguistic variable X f is completely defined by T f 2 parameters. Fig. 4 shows the C DB chromosome part, which consists of F vectors of T f 2 real numbers: the f th vector contains the ½bf ;2 ; . . . bf ;T f 1 cores that define the positions of the MFs for the linguistic variable X f . To ensure a good integrity level of order, coverage and distinguishability, 8j 2 ½2; T f 1, we force bf ;j h of the MFs, in terms i bf ;jþ1 bf ;j b b . to vary in the definition interval bf ;j f ;j 2 f ;j1 ; bf ;j þ 2
241
3.4. The genetic operators
242
In order to generate the offspring populations, we exploit both crossover and mutation. We apply the two-point crossover to the C RB part and the BLX-a crossover, with a ¼ 0:5, to the C DB part. Let s1 and s2 be two selected parent chromosomes. The two crossover points are chosen by extracting randomly two numbers in ½1; qMAX 1, where qMAX is the maximum number
228
232 233 234 235 236 237 238 239
243 244
Fig. 3. An example of the C RB part of a chromosome.
Fig. 4. The C DB part of a chromosome.
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
6
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
256
of rules in s1 and s2 . When we apply the two-point crossover to the RB part, we can generate an RB with one or more pairs of equal rules. In this case, we simply eliminate one of the rules from each pair setting the correspondent km to zero. This allows us to reduce the total number of rules. As regards the mutation, for the C RB part, we have applied two well-known operators, namely, random mutation [28] and flip-flop mutation [48]. The first step of each operator randomly selects a pair pm , i.e. a rule, in the chromosome. The random mutation operator replaces the value of km in the selected pair with an integer value randomly generated in ½1; . . . ; M C4:5 . The flip-flop mutation operator modifies the antecedent v m of the selected rule by complementing each gene v m;f with a probability equal to P cond (Pcond ¼ 2=F in the experiments). After applying the two mutation operators, we remove the duplicate rules in the RB. We have applied the random mutation operator also to C DB . The operator, first, randomly chooses an input variable X f ; f 2 ½1; F, and a fuzzy set j 2 ½2; T f 1 and then replaces the value of bf ;j with a value randomly chosen within the definition interval of bf ;j .
257
3.5. The multi-objective evolutionary algorithm
258
As MOEA, we adopt the (2 + 2)M-PAES we proposed in [17]. Unlike classical (2 + 2)PAES, which uses only mutation to generate new candidate solutions, (2 + 2)M-PAES exploits both crossover and mutation. Further, in (2 + 2)M-PAES, current solutions are randomly extracted at each iteration rather than maintained until they are not replaced by solutions with particular characteristics. Fig. 5 shows a pseudo-code which describes the application scheme of the different operators to generate the offspring solutions o1 and o2 from the selected parents s1 and s2 . Note that P CRB and P CDB represent the probabilities of applying the crossover operator to C RB and C DB parts, respectively. PMRB1 and PMRB2 represent, respectively, the probabilities of applying the first and the second mutation operators to C RB . Finally, P MDB represents the probability of applying the mutation operator to C DB . At the beginning, we generate two solutions s1 and s2 . The genes of the C DB parts of s1 and s2 are randomly generated. As regards the C RB part, we randomly generate the km values, while we initialize v m ¼ 1 for all the antecedents of all the rules. At each iteration, the application of crossover and mutation operators produces two new candidate solutions from the current solutions s1 and s2 . These candidate solutions are added to the archive only if they are dominated by no solution contained in the archive; possible solutions in the archive dominated by the candidate solutions are removed. Typically, the size of the archive is fixed at the beginning of the execution of the (2 + 2)M-PAES. In this case, when the archive is full and a new solution z has to be added to the archive, if z dominates no solution in the archive, then we insert z into the archive and remove the solution (possibly z itself) that belongs to the region with the highest crowding degree. If the region contains more than
245 246 247 248 249 250 251 252 253 254 255
259 260 261 262 263 264 265 266 267 268 269 270 271 272 273
Fig. 5. Application scheme of the genetic operators.
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
7
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx Table 1 Datasets used in the experiments (sorted for increasing numbers of input variables). Dataset
# Instances
# Variables
# Classes
Haberman (HAB) Hayes-roth (HAY) Iris (IRI) Mammographic (MAM) Newthyroid (NEW) Tae (TAE) Bupa (BUP) Appendicitis (APP) Pima (PIM) Glass (GLA) Saheart (SAH) Wisconsin (WIS) Cleveland (CLE) Heart (HEA) Wine (WIN) Australian (AUS) Vehicle (VEH) Bands (BAN) Hepatitis (HEP) Pasture (PAS) Wdbc (WDB) Dermatology (DER) Ionosphere (ION) Sonar (SON)
306 160 150 830 215 151 345 106 768 214 462 683 297 270 178 690 846 365 80 36 569 358 351 208
3 3 4 5 5 5 6 7 8 9 9 9 13 13 13 14 18 19 19 21 30 34 34 60
2 3 3 2 3 3 2 2 2 6 2 2 5 2 3 2 4 2 2 3 2 6 2 2
279
one solution, then, the solution to be removed is randomly chosen.(2 + 2)M-PAES determines an approximation of the optimal Pareto front by concurrently maximizing the accuracy and minimizing the complexity. The accuracy is calculated in terms of classification rate, i.e. percentage of correctly classified patterns. The RB complexity is measured as the total number of the conditions that compose the antecedents of the rules in the RB. In [34] this number is denoted as total rule length (TRL). Low values of TRL correspond to RBs characterized by a low number of rules and a low number of input variables really used in each rule.
280
4. Experimental results
281
299
We tested our method (denoted as PAES-RCS in the following) on twenty-four classification datasets extracted from the KEEL repository.1 As shown in Table 1, the datasets are characterized by different numbers of input variables (from 3 to 60), input/output instances (from 80 to 846) and classes (from 2 to 6). For the datasets CLE, DER, HEP, MAM, and WIS, we removed the instances with missing values. The number of instances in the table refers to the datasets after the removing process. For each dataset, we performed a ten-fold cross-validation and executed three trials for each fold with different seeds for the random function generator (30 trials in total). All the results presented in this section are obtained by using the same folds for all the algorithms. Table 2 shows the parameters of PAES-RCS used in the experiments. In the following, we first show the accuracy and complexity of the sets of candidate rules obtained by applying the C4.5 algorithm to the transformed training set. Second, we present a study on the number of fitness evaluations to be used as stopping criterion for PAES-RCS. We point out how 50,000 fitness evaluations are sufficient to allow PAES-RCS to achieve Pareto front approximations statistically equivalent to the ones obtained after 1,000,000 fitness evaluations. Third, we perform an analysis of the Pareto front approximations generated by PAES-RCS in comparison with the ones obtained by the two state-of-the-art MOEFSs discussed in [3]. These MOEFSs apply NSGA-II to select rules from a set of candidate rules and concurrently perform the tuning of the MF parameters. The comparison is performed taking into account both the effectiveness of the evolutionary process and the generalization capabilities of the generated solutions. Further, the execution times of the three approaches are also discussed and compared. Fourth, we carry out an analysis on how the RCS strategy and the appropriate generation of the set of candidate rules performed by the C4.5 algorithm both contribute to the good performance of PAES-RCS. Finally, we compare the FRBCs generated by our method with the classifiers generated by two state-of-the-art non-evolutionary learning algorithms, namely FURIA [31] and the classical C4.5 [44].
300
4.1. Generation of the set of candidate rules
301
To generate the set of candidate rules, we first pre-process the training set: each continuous variable is transformed into a categorical and ordered variable by using a fuzzy uniform partition of T f ¼ 5 fuzzy sets. Then, we execute the C4.5 algorithm implemented in WEKA [50] on the transformed training patterns. The value of T f has been heuristically determined by
274 275 276 277 278
282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298
302 303
1
Available at http://sci2s.ugr.es/keel/datasets.php, [4].
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
8
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx Table 2 Values of the parameters used in the experiments for PAES-RCS. AS Tf MMAX P CRB P CDB P MRB1 P MRB2 P MDB
304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
(2 + 2)M-PAES archive size Number of fuzzy sets in each variable X f ; f ¼ 1; . . . ; F Maximum number of rules in an RB Probability of applying the crossover operator to C RB Probability of applying the crossover operator to C DB Probability of applying the first mutation operator to C RB Probability of applying the second mutation operator to C RB Probability of applying the mutation operator to C DB
32 5 minð50; MC4:5 Þ 0.1 0.5 0.1 0.7 0.2
performing several experiments using T f ¼ 3; T f ¼ 5, and T f ¼ 7: we observed that on average the solutions obtained using T f ¼ 5 are characterized by a higher accuracy. We set the confidence factor for pruning and the minimum number of instances per leaf to 0.25 and 1, respectively. Since different training sets might generate different decision trees, we have to calculate the set of candidate rules for each fold used in the cross-validation: thus in total we execute the C4.5 algorithm ten times. Actually, since we are interested in sets of candidate rules that contain at least one rule for each class, for each fold, we decide if extracting these sets from the pre- or post-pruned decision trees: if the pruned decision tree has not at least one rule for each class, we use the pre-pruned version of the tree; otherwise, we use the pruned version. In our experiments, we used the pre-pruned decision tree just in the case of HAB and APP datasets. Table 3 shows for each dataset, the average accuracy of the sets of candidate rules calculated over the ten folds for both the training (AccTr ) and test (AccTs ) sets. In the table we show also the average values of the total rule length (TRL), number of rules (Rules) and number of selected input variables (Var). We observe that the approach used for generating the set of candidate rules performs a feature selection, in particular, for datasets with a high number of input variables. For example, for the SON and ION datasets, only 15 and 8.4 input variables out of 60 and 34, respectively, are used in the antecedents of the rules. Furthermore the sets of candidate rules are characterized by a good accuracy on the training set. Indeed, for some datasets such as APP, AUS, WIS, and DER, this accuracy is very close to the value of the accuracy of the FRBCs obtained by PAES-RCS (see Section 4.3). The main differences are on the values of complexity, and consequently on the values of the accuracy on the test set. Indeed, the sets of candidate rules generated by the C4.5 algorithm applied to the transformed datasets are characterized by high values of TRL and number of rules. Thus, their generalization capability is very low, and this can be observed mainly when the values of the TRL are very high (CLE and SON are two examples). Table 3 Average accuracy on training and test sets, total rule length, number of rules and selected input variables for the candidate rule bases generated by the C4.5 algorithm. Dataset
AccTr
AccTs
TRL
Rules
Var
HAB HAY IRI MAM NEW TAE BUP APP PIM GLA SAH WIS CLE HEA WIN AUS VEH BAN HEP PAS WDB DER ION SON
74.87 81.25 64.29 78.65 92.82 51.59 59.74 87.22 73.95 56.14 70.06 96.45 64.31 83.62 84.21 86.76 44.52 62.58 91.63 73.43 79.28 93.66 87.15 72.75
72.86 75.00 64.66 78.03 92.60 43.83 58.18 85.18 73.40 54.69 67.53 95.19 56.57 76.30 78.53 85.36 43.40 58.58 82.42 58.33 79.42 89.95 85.78 67.74
95.8 71.0 14.4 69.8 54.3 100.1 45.3 11.6 110.7 158.7 202.0 93.7 882.7 156.1 56.2 212.0 2210.9 360.5 36.7 32.0 208.0 221.3 140.8 266.7
37.0 29.0 9.0 27.4 21.4 35.0 19.0 7.4 32.6 48.6 49.0 33.4 189.4 49.4 24.2 45.8 337.4 78.2 15.0 15.4 44.2 51.0 37.8 72.6
2.8 3.1 2.0 4.2 5.0 4.9 3.5 1.6 6.5 7.7 8.1 5.7 13.0 9.0 5.6 9.2 17.9 12.9 3.5 3.5 10.4 10.0 8.4 15.5
Mean Std.
75.45 14.05
71.81 14.59
242.1 455.4
54.5 70.2
7.2 4.4
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
9
329
During the evolutionary process, PAES-RCS selects rules and conditions from these sets of candidate rules and concurrently learns the parameters of the MFs that define the partitions. This synergy produces two positive effects. On the one side, the rule and condition selection strategy allows selecting the set of the most effective candidate rules and generalizing these rules by reducing the number of antecedent conditions. On the other side, the MF parameter learning permits adapting the partitions to the specific training set.
330
4.2. Determining the number of fitness evaluations
331
359
In order to determine the number of fitness evaluations to be used as stopping criterion, we execute the PAES-RCS on all the datasets for 1,000,000 fitness evaluations and compare the classification accuracies on the test set by using the procedure adopted in our previous papers [1,8] and also in [24]. This procedure is based on the analysis of three representative solutions of the Pareto front approximations. In practice, for each of the thirty trials, we compute the Pareto front approximations of each algorithm and sort the solutions in each approximation for decreasing accuracies on training set. Then, for each approximation, we select the first (the most accurate), the median and the last (the least accurate) solutions. We denote these solutions as FIRST, MEDIAN and LAST, respectively. Finally, for the three solutions, we compute the mean values over the 30 trials of the accuracy on the training and test sets, and of the TRL. To determine the number of fitness evaluations, we compare the classification rates on the test set for the FIRST, MEDIAN and LAST solutions after 10,000 (PAES-RCS10), 50,000 (PAES-RCS50), 100,000 (PAES-RCS100) and 1,000,000 (PAES-RCS1m) fitness evaluations. To statistically validate the results, we apply non-parametric statistical test for multiple comparisons by using all the datasets. First, for each version of PAES-RCS (i.e., PAES-RCS10, PAES-RCS50, PAES-RCS100 and PAES-RCS1m), we generate a distribution consisting of the mean values of the accuracies of the three solutions calculated on the test set. Then, we apply the Friedman test in order to compute a ranking among the distributions [23], and the Iman and Davenport test [32] to evaluate whether there exists a statistical difference among the distributions. If the Iman and Davenport p-value is lower than the level of significance a (in the experiments a ¼ 0:05), we can reject the null hypothesis and affirm that there exist statistical differences between the multiple distributions associated with each approach. Otherwise, no statistical difference exists. If there exists a statistical difference, we apply a post hoc procedure, namely the Holm test [30]. This test allows detecting effective statistical differences between the control approach, i.e. the one with the lowest Friedman rank, and the remaining approaches. From Table 4, we observe that PAES-RCS50 outperforms PAES-RCS10, and is statically equivalent to both PAES-RCS100 and PAES-RCS1m on the FIRST solution. The MEDIAN and LAST solutions are statistically equivalent independently of the number of fitness evaluations. This result was expected because the LAST solutions are very simple and therefore their accuracy cannot be appreciably improved during the evolutionary process. On the other hand, although the accuracies of the MEDIAN solutions grow from 10,000 to 50,000 fitness evaluations, the increases are not statistically significant. Since PAES-RCS50 is statistically equivalent to PAES-RCS100 and PAES-RCS1m on all the three solutions, we selected 50,000 fitness evaluations (the minimum number of evaluations which guarantees the same performance as 1 million evaluations) as stopping criterion for our experiments. For the sake of simplicity, we denote PAES-RCS50 as PAES-RCS in the following.
360
4.3. Analysis of the Pareto front approximations in comparison with two state-of-the-art MOEFSs
361
In this section, we analyze the Pareto front approximations generated by PAES-RCS in comparison with the ones obtained by the two state-of-the-art MOEFSs discussed in [3]. As regards these MOEFSs, their first step consists of generating, for each class, a set of short rules, i.e. rules with a reduced number of conditions in the antecedent part. In particular, the authors set the maximum number of antecedent conditions equal to three and two for datasets with a number of input variables lower than, and higher than or equal to thirty input variables, respectively. The specific rule generation heuristic is based on two well-known data mining rule evaluation measures, namely the confidence and the support, and uses the multiple granularity. Indeed, the same input variable can be partitioned by using a different number of fuzzy sets in different rules. The first approach (denoted as NSGAII-MG in the following) uses the multiple granularity also in the evolutionary rule selection step. The second approach (denoted as NSGAII-SG in the following) employs a mechanism to specify a single granularity for each input variable, according to the frequency of the employed partitions and the importance of the multiple granularity-based extracted rules. Subsequently, a new set of short rules is generated, considering the single granularities, and the evolutionary rule selection is performed. We executed NSGAII-MG and NSGAII-SG using the same parameters as in [3] and in particular 1,000,000 fitness evaluations. The comparison among the three algorithms is performed taking into account both the effectiveness of the evolutionary process and the generalization capabilities of the generated solutions. The former is evaluated by using two well-known metrics commonly used in the literature to asses the quality of the Pareto front approximations, namely the hypervolume and the epsilon dominance [53]. For simplicity, the latter is measured by means of the classification accuracy of the FIRST solution calculated on the test set. On the other hand, since the FIRST solution corresponds to the most accurate FRBC on the training set, it is also the solution more prone to suffer from overtraining. Thus, if the accuracy of this solution computed on the test set is high, it is likely that all the solutions on the Pareto front can be characterized by good generalization capabilities.
325 326 327 328
332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358
362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
10
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
Table 4 Results of the non-parametric statistical tests on the accuracy computed on the test set for the FIRST, MEDIAN and LAST solutions generated by PAES-RCS10, PAES-RCS50, PAES-RCS100 and PAES-RCS1m. Algorithm
Friedman rank
Iman and Davenport p-value
Hypothesis
PAES-RCS1m PAES-RCS100 PAES-RCS50 PAES-RCS10
2.6875 2.3542 1.9167 3.0417
0.0160
Rejected
Algorithm
z-value
p-value
alpha/i
Hypothesis
3.0187 2.7136 1.4012
0.0025 0.0386 0.2404
0.016 0.025 0.05
Rejected Not rejected Not rejected
Algorithm
Friedman rank
Iman and Davenport p-value
Hypothesis
PAES-RCS1m PAES-RCS100 PAES-RCS50 PAES-RCS10
2.6875 2.4375 2.5625 2.3125
0.7787
Not rejected
PAES-RCS1m PAES-RCS100 PAES-RCS50 PAES-RCS10
2.3750 2.2292 2.4375 2.9583
0.2249
Not rejected
FIRST
i
Holm post hoc procedure 3 PAES-RCS10 2 PAES-RCS1m 1 PAES-RCS100
MEDIAN
LAST
382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415
Hypervolume and epsilon dominance represent a means to quantify the differences between Pareto front approximations by mapping each set of solutions to a real number. The hypervolume indicator measures the hypervolume of the portion of the objective space that is weakly dominated by a Pareto front approximation. In order to compute this indicator, the objective space must be either bounded or a bounding reference point, that is (at least weakly) dominated by all the solutions, must be defined. The epsilon dominance was proposed in [53] and makes direct use of the concept of Pareto dominance. Let us consider two Pareto front approximations A and B, where A dominates B. The epsilon dominance is a measure of the smallest distance one would need to translate every solution in B so that B dominates A. If C is chosen to be a reference Pareto front approximation, such that it dominates the A and B Pareto front approximations, then A and B can be directly compared with each other on the basis of the epsilon dominance with respect to C. The computations of the two indicators were performed using the performance assessment package provided in the PISA toolkit [12]. First, the maximum values of accuracy and total rule length of the 90 Pareto front approximations (generated in the 30 trials executed for each of the three MOEAs) are computed in order to obtain the bounds for normalizing in [0, 1] the objectives of each approximation. Then, the objectives are normalized. The hypervolume is calculated by using (1, 1) as reference point. To compute the epsilon dominance, the reference Pareto front approximation consists of the non-dominated solutions among the solutions on the 90 Pareto front approximations. As a consequence of the normalization, the value of the two indicators is normalized in [0, 1]. To visually evaluate the performance of PAES-RCS with respect to the comparison algorithms, for all the datasets, in Fig. 6 we plot on the accuracy/TRL plane the three representative solutions of PAES-RCS, NSGAII-MG and NSGAII-SG, on both the training and test sets. We note that, mainly for the datasets with a high number of input variables, the Pareto front approximations generated by PAES-RCS are wider than the ones generated by NSGAII-MG and NSGAII-SG, although PAES-RCS performs only the 5% of the fitness evaluations of both NSGAII-MG and NSGAII-SG. Indeed, PAES-RCS executes only 50,000 fitness evaluations against 1,000,000 fitness evaluations performed by these two MOEAs. This confirms the good exploration capability of PAES-RCS. On the other hand, the three representative solutions of PAES-RCS are in general more complex than the ones generated by NSGAII-MG and NSGAII-SG. To quantitatively compare the Pareto front approximations, for each algorithm we have generated a distribution consisting of the mean values of the hypervolume and of the epsilon dominance by using all the datasets. In order to assess if there exists a statistical difference among the indicators and consequently among the Pareto front approximations, we apply the Friedman test and the Iman and Davenport test. In Table 5, we show the Friedman rank and the Iman and Davenport p-value for each algorithm. Since the p-values calculated for both the indicators are larger than 0.05, there is no statistical difference among the three distributions for both the indicators. Thus, we conclude that, even if PAES-RCS executes only the 5% of fitness evaluations of both NSGAII-MG and NSGAII-SG, it is able to generate Pareto front approximations statistically equivalent to the Pareto front approximations generated by the other two algorithms.
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
11
469
As regards the generalization capabilities of the generated solutions, Table 6 shows, for each dataset, the average values of the accuracy on the test set, of the TRL, of the number of rules and of the DB interpretability index GM3M for the FIRST solutions generated by NSGAII-MG and NSGAII-SG after 1,000,000 fitness evaluations, and for the FIRST solution of PAES-RCS after 50,000 fitness evaluations. For the sake of brevity, we do not show the accuracy on the training set. The GM3M index was proposed in [24] and is defined as the geometric mean of three metrics, namely the MFs displacement d, the MFs lateral amplitude rate c and the MFs area similarity q. All the three metrics contribute to measure how much the final partitions obtained by the evolutionary process are different from the original partitions. In particular, d measures the proximity of the central points of the final MFs to the central points of the original MFs, c and q measure, respectively, the left/right rate pffiffiffiffiffiffiffiffiffi differences and the area similarity of the final and original MFs. GM3M is formally defined as GM3M ¼ 3 dcq: The maximum value of GM3M is equal to 1 and corresponds to the highest level of DB interpretability. For each dataset, we have shown in bold the values of the solutions with the highest classification rate, the lowest TRL, the lowest number of rules and the highest GM3M index. We can observe that the FIRST solutions generated by PAES-RCS are on average more complex than the ones generated by NSGAII-SG and NSGAII-MG, but are also more accurate. Furthermore, since the tuning performed by NSGAII-MG and NSGAII-SG performs just a constrained lateral displacement of the membership functions without changing their shapes, these two algorithms generate partitions characterized by a higher average value of the GM3M. However, although our approach is characterized by a lower value of the GM3M index, ordering, distinguishability, covering and normality are always guaranteed, thus preserving a good level of DB interpretability. To statistically verify that the FIRST solutions generated by PAES-RCS are more accurate than the ones generated by both NSGAII-MG and NSGAII-SG, for each algorithm, we consider a distribution consisting of the mean values of the accuracy of the FIRST solutions on the test set by using all the datasets. In order to assess if there exist statistical differences among the indicators and consequently among the Pareto front approximations, we apply the Friedman test and the Iman and Davenport test. In Table 7 we show the Friedman rank and the Iman and Davenport p-value for each algorithm. We observe that the statistical hypothesis of equivalence is rejected. Thus, we have to apply the Holm post hoc procedure considering the PAES-RCS as control algorithm (associated with the lowest rank and in bold in the Table). Since the statistical hypothesis of equivalence is rejected in the case of the FIRST solution generated by NSGAII-MG, we can conclude that the FIRST solution generated by PAES-RCS is statistically more accurate than this solution. On the other hand, the statistical hypothesis of equivalence cannot be rejected for NSGAII-SG. As regards the complexity, we apply the Friedman test and the Iman and Davenport test to the distributions consisting of the mean values of the TRLs of the FIRST solutions. In Table 8 we show the Friedman rank and the Iman and Davenport p-value for each algorithm. We observe that the statistical hypothesis of equivalence is rejected. Thus, we have to apply the Holm post hoc procedure considering the NSGAII-MG as control algorithm. We note that the FIRST solutions generated by PAES-RCS result statistically more complex than the FIRST solutions generated by NSGAII-MG and of the same complexity of the FIRST solutions generated by NSGAII-SG. On the other hand, we have presented in Table 7 that the FIRST solutions of PAES-RCS statistically outperform in terms of classification rate the FIRST solutions of NSGAII-MG and are statistically equivalent to the FIRST solutions of NSGAII-SG. Further, we will show in Section 4.4 that the execution time needed by PAES-RCS is much lower than the ones needed by NSGAII-MG and NSGAII-SG. For the sake of completeness, we also compare the three algorithms after the same number of fitness evaluations, namely 50,000 and 1,000,000. Also these comparisons are carried out taking into account both the effectiveness of the evolutionary process and the generalization capabilities of the generated solutions. As regards the effectiveness of the evolutionary process, we apply the Friedman test and the Iman and Davenport test to the distributions consisting of the mean values of the hypervolume and epsilon dominance computed on the Pareto front approximations generated by NSGAII-MG, NSGAII-SG and PAES-RCS using 50,000 and 1,000,000 fitness evaluations, respectively. In Tables 9 and 10 we show the Friedman rank and the Iman and Davenport p-value computed for each algorithm after 50,000 and 1,000,000 fitness evaluations, respectively. We observe that no statistical difference among the distributions exists. To evaluate the difference in the generalization capabilities of the solutions generated by the three algorithms in relation to the number of fitness evaluations, we compare the classification accuracies achieved by the FIRST solutions on the test set. In Tables 11 and 12 we show the Friedman rank and the Iman and Davenport p-value computed on the classification rate of the FIRST solutions obtained by each algorithm after 50,000 and 1,000,000 fitness evaluations, respectively. We observe that the statistical hypothesis of equivalence is rejected. Thus, we have to apply the Holm post hoc procedure considering the PAES-RCS as control algorithm (associated with the lowest rank and in bold in the Table). The null hypothesis is rejected in both the cases. Thus, we can conclude that PAES-RCS generates FRBCs more accurate than the other two algorithms after both 50,000 and 1,000,000 fitness evaluations.
470
4.4. Analysis of the execution times of PAES-RCS, NSGAII-MG and NSGAII-SG
471
In Table 13, for all the three evolutionary approaches (PAES-RCS, NSGAII-MG and NSGAII-SG), we show the average times for generating the sets of candidate rules (Candidate rule generation time), the average total times for executing the complete approach (Total time), and the percentage of time needed for the generation of the set of candidate rules with respect to the
416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468
472 473
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
12
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
Fig. 6. Plots of the FIRST, MEDIAN and LAST solutions of PAES-RCS, NSGAII-MG and NSGAII-SG, on both the training and test sets, onto the TRL-classification rate plane.
Table 5 Results of the non-parametric statistical tests on the hypervolume and epsilon dominance computed on the Pareto front approximations generated by NSGAIIMG, NSGAII-SG and PAES-RCS. Algorithm
Friedman rank
Iman and Davenport p-value
Hypothesis
NSGAII-MG NSGAII-SG PAES-RCS
2.0141 2.4583 1.5000
0.0023
Rejected
Algorithm
z-value
p-value
alpha/i
Hypothesis
3.3197 1.8763
0.0009 0.0602
0.025 0.05
Rejected Not rejected
Friedman rank
Iman and Davenport p-value
Hypothesis
1.9091 2.3636 1.7273
0.0919
Not rejected
Hypervolume
i
Holm post hoc procedure 2 NSGAII-SG 1 NSGAII-MG Algorithm Epsilon dominance NSGAII-MG NSGAII-SG PAES-RCS
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
Q3
13
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx Table 6 Average accuracy, TRL and number of rules of the FIRST solutions generated by NSGAII-MG, NSGAII-SG and PAES-RCS. Dataset
NSGAII-MG
NSGAII-SG
PAES-RCS
AccTs
TRL
Rules
GM3M
AccTs
TRL
Rules
GM3M
AccTs
TRL
Rules
GM3M
HAB HAY IRI MAM NEW TAE BUP APP PIM GLA SAH WIS CLE HEA WIN AUS VEH BAN HEP PAS WDB DER ION SON
71.02 77.48 93.99 81.05 94.12 55.69 63.20 87.00 75.97 64.31 70.70 95.82 54.53 82.96 94.14 85.60 63.91 68.82 90.38 82.77 95.49 95.23 88.63 73.77
12.3 12.8 10.3 10.3 8.3 17.5 12.2 3.4 12.8 25.7 9.6 9.0 49.0 11.9 7.1 5.5 26.9 12.9 7.7 5.0 7.7 14.9 10.0 7.4
5.5 8.2 6.9 5.4 4.7 7.5 4.9 2.3 5.7 10.5 4.6 5.6 17.7 6.5 3.9 2.2 10.8 4.7 3.1 3.3 4.9 9.1 6.7 5.1
0.70 0.74 0.76 0.68 0.63 0.60 0.64 0.66 0.61 0.65 0.66 0.68 0.58 0.62 0.84 0.57 0.62 0.62 0.55 0.75 0.64 0.60 0.56 0.65
71.88 78.88 97.33 80.49 94.60 60.78 67.19 87.30 77.05 71.28 70.13 96.35 58.80 82.84 93.03 85.65 66.16 65.80 88.53 76.94 94.90 94.48 90.79 78.90
6.3 15.9 4.6 15.0 10.4 22.1 22.0 6.3 11.0 34.8 16.3 11.0 52.5 18.8 10.2 12.2 32.0 15.7 6.8 6.1 8.6 17.4 10.7 9.0
3.4 10.0 4.0 7.1 5.4 9.8 9.6 3.1 5.2 15.4 6.8 6.3 19.4 8.8 5.8 5.1 11.9 6.6 3.0 3.8 5.1 10.6 8.1 5.3
0.73 0.83 0.87 0.77 0.94 0.63 0.79 0.76 0.58 0.82 0.58 0.64 0.68 0.67 0.92 0.71 0.61 0.63 0.59 0.86 0.61 0.57 0.64 0.66
72.65 84.03 95.33 83.37 95.35 60.81 68.67 85.09 74.66 72.13 70.92 96.46 59.06 83.21 93.98 85.80 64.89 67.56 83.21 76.84 95.14 95.43 90.40 77.00
17.3 12.5 9.7 14.1 11.5 21.4 21.0 6.3 19.9 28.7 28.7 21.3 50.0 21.0 15.4 24.6 37.3 36.0 21.0 8.4 16.7 22.0 33.7 30.5
11.7 9.6 7.2 9.4 8.5 15.1 12.2 5.6 13.6 17.0 18.3 15.4 22.8 14.3 11.1 14.3 14.9 21.0 14.3 7.6 11.0 17.7 19.9 17.2
0.60 0.64 0.64 0.54 0.53 0.53 0.57 0.69 0.52 0.49 0.51 0.51 0.5 0.53 0.52 0.57 0.44 0.48 0.67 0.6 0.51 0.53 0.53 0.52
Mean Std.
80.40 11.96
12.8 9.6
6.1 3.3
0.64 0.06
79.44 13.14
15.7 10.9
7.5 3.9
0.70 0.1
80.50 11.75
22.0 10.4
13.7 4.5
0.54 0.05
483
total time needed by the complete approach (%). Times are expressed in seconds. Thanks to the authors of [3], who have provided us with the codes of NSGAII-MG and NSGAII-SG, we have executed all the approaches on the same PC equipped with an Intel core i5 750, at 2.67 GHz, 4 GB RAM and Ubuntu operating system. As shown in Table 13, the percentage of the total execution time needed by PAES-RCS to generate the set of candidate rules is negligible and is quite independent of the number of variables in the datasets. On the other hand, the generation of the decision trees results to be very fast when the C4.5 algorithm deals with categorical input variables. On the contrary, the percentage of the total execution time needed by both NSGAII-MG and NSGAII-SG to generate the candidate rules is quite high and increases with the number of variables, up to almost 50%. Further, the total execution time needed by PAES-RCS is on average much lower than the one needed by the other two approaches. This is due not only to the lower number of fitness evaluations performed by PAES-RCS, but also to the different times needed to generate the candidate rules.
484
4.5. Analysis of the synergy between the RCS strategy and the appropriate generation of the set of candidate rules
485
To verify whether both the RCS strategy and the appropriate generation of the set of candidate rules performed by the C4.5 algorithm contribute to the good performance of PAES-RCS, we have implemented two different versions of PAES-RCS. The first version, denoted as PAES-RCS-WM, exploits a set of candidate rules generated by an extension of the well-known Wang and Mendel algorithm proposed for classification problems [15]. The second version, denoted as PAESRS, learns the rule base during the evolutionary process by only selecting rules rather than selecting both rules and conditions. By comparing the two versions with the proposed PAES-RCS, we can realize if both the RCS strategy and the C4.5-based generation of the candidate rules contribute to the improvement of the performances.
474 475 476 477 478 479 480 481 482
486 487 488 489 490 491
Table 7 Results of the non-parametric statistical tests on the accuracy computed on the test set among the most accurate solutions generated by NSGAII-MG, NSGAII-SG and PAES-RCS.
i
Algorithm
Friedman rank
Iman and Davenport p-value
Hypothesis
NSGAII-MG NSGAII-SG PAES-RCS
2.2917 2.1250 1.5833
0.037
Rejected
Algorithm
z-value
p-value
alpha/i
Hypothesis
2.4537 1.8763
0.0141 0.060
0.025 0.05
Rejected Not rejected
Holm post hoc procedure 2 NSGAII-MG 1 NSGAII-SG
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
14
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
Table 8 Results of the parametric statistical tests on the TRLs among the most accurate solutions generated by NSGAII-MG, NSGAII-SG and PAES-RCS.
i
Algorithm
Friedman rank
Iman and Davenport p-value
Hypothesis
NSGAII-MG NSGAII-SG PAES-RCS
1.2083 2.1250 2.6667
1.476E8
Rejected
Algorithm
z-value
p-value
alpha/i
Hypothesis
Holm post hoc procedure 2 PAES-RCS 1 NSGAII-SG
5.0526 3.1754
0.000004 0.00154
0.025 0.05
Rejected Rejected
Pairwise Holm post hoc procedure NSGAII-MG vs.PAES-RCS NSGAII-SG vs. NSGAII-MG NSGAII-SG vs. PAES-RCS
4.0526 3.1754 1.8763
0.000004 0.001546 0.0706
0.01666 0.025 0.05
Rejected Rejected Not rejected
506
We compare PAES-RCS with PAES-RCS-WM and PAES-RS in terms of both the effectiveness of the evolutionary process, using the hypervolume and the epsilon dominance, and the generalization capabilities, using the accuracy on the test set of the FIRST solutions. The results of the comparison are statistically validated using the Wilcoxon signed-rank test, a non-parametric statistical test that detects significant differences between two sample means [49]. Tables 14 and 15 show the results of the statistical test on, respectively, hypervolume and epsilon dominance, and accuracy of the FIRST solution on the test set for the comparisons between PAES-RCS and PAES-RCS-WM, and between PAES-RCS and PAES-RS, respectively. In the tables, Rþ and R represent the sum of the ranks corresponding to PAES-RCS, and to PAES-RCS-WM and PAES-RS, respectively. As regards hypervolume and epsilon dominance, we observe that PAES-RCS and PAES-RCS-WM are statistically equivalent, and PAES-RCS outperforms PAES-RS. This attests that the effectiveness of the evolutionary approach is affected by the learning strategy, but not by the specific set of candidate rules. On the other hand, on the accuracy of the FIRST solution, PAES-RCS statistically outperforms both the approaches. This testifies that the set of candidate rules affects the accuracy of the solutions generated by the evolutionary process. In particular, since PAES-RCS outperforms PAES-RCS-WM we can deduce that the C4.5 algorithm is a good choice for generating the set of candidate rules. Concluding, both the RCS strategy and the appropriate set of candidate rules generated by the C4.5 algorithm contribute to the good behavior of PAES-RCS.
507
4.6. Comparison with two non-evolutionary classification algorithms
508
In this section, we compare the results achieved by using PAES-RCS with the ones achieved by two state-of-the-art learning algorithms, namely FURIA and C4.5. Unlike PAES-RCS, NSGAII-MG and NSGAII-SG, both FURIA and C4.5 algorithms are not based on evolutionary learning. As regards the C4.5 algorithm [44], it is a decision tree based classifier generated by exploiting the concept of information entropy. At each node of the tree, the C4.5 algorithm chooses one attribute of the training set that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. As regards FURIA (Fuzzy Unordered Rules Induction Algorithm), it was introduced in [31] as an extension of the RIPPER algorithm [18]. The main extensions regard: (i) the use of fuzzy rather than crisp rules, (ii) the exploitation of unordered rather than ordered rule sets, and (iii) the introduction of a novel rule stretching method in order to manage uncovered examples. The descriptions of both FURIA and RIPPER can be found in [31,18], respectively.
492 493 494 495 496 497 498 499 500 501 502 503 504 505
509 510 511 512 513 514 515 516 517 518 519
Table 9 Results of the non-parametric statistical tests on the hypervolume and epsilon dominance computed on the Pareto front approximations generated by NSGAIIMG, NSGAII-SG and PAES-RCS after 50,000 fitness evaluations. Algorithm
Friedman rank
Iman and Davenport p-value
Hypothesis
NSGAII-MG NSGAII-SG PAES-RCS
2.0417 2.2083 1.7500
0.2801
Not rejected
Epsilon dominance NSGAII-MG NSGAII-SG PAES-RCS
1.8750 2.0833 2.0417
0.7548
Not rejected
Hypervolume
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
15
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
Table 10 Results of the non-parametric statistical tests on the hypervolume and epsilon dominance computed on the Pareto front approximations generated by NSGAIIMG, NSGAII-SG and PAES-RCS after 1,000,000 fitness evaluations. Algorithm
Friedman rank
Iman and Davenport p-value
Hypothesis
NSGAII-MG NSGAII-SG PAES-RCS
2.2083 2.2917 1.500
0.008
Not rejected
Epsilon dominance NSGAII-MG NSGAII-SG PAES-RCS
1.9167 2.1667 1.9167
0.6161
Not rejected
Hypervolume
Table 11 Results of the statistical tests on the accuracy calculated on the test set among the most accurate solutions of NSGAII-MG, NSGAII-SG and PAES-RCS, all executed using 50,000 fitness evaluations.
i
Algorithm
Friedman rank
Iman and Davenport p-value
Hypothesis
NSGAII-MG NSGAII-SG PAES-RCS
2.3750 2.1667 1.4583
0.0023
Rejected
Algorithm
z-value
p-value
alpha/i
Hypothesis
3.1754 2.4537
0.0015 0.0141
0.025 0.05
Rejected Rejected
Holm post hoc procedure 2 NSGAII-MG 1 NSGAII-SG
Table 12 Results of the statistical tests on the accuracy calculated on the test set among the most accurate solutions of NSGAII-MG, NSGAII-SG and PAES-RCS, all executed using 1,000,000 fitness evaluations.
i
Algorithm
Friedman rank
Iman and Davenport p-value
Hypothesis
NSGAII-MG NSGAII-SG PAES-RCS
2.3333 2.1250 1.5417
0.0143
Rejected
Algorithm
z-value
p-value
alpha/i
Hypothesis
2.7424 2.0207
0.0060 0.043
0.025 0.05
Rejected Rejected
Holm post hoc procedure 2 NSGAII-MG 1 NSGAII-SG
520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538
539 541
We have used the implementations of the FURIA and C4.5 algorithms provided in WEKA and have executed both algorithms by using the default parameters (for FURIA, these parameters are those suggested in [31], where FURIA was introduced). Table 16 shows, for each dataset, the average values of the accuracy on the test set, of the TRL and number of rules for the solutions generated by the FURIA and C4.5 algorithms, and for the FIRST solutions of PAES-RCS. We can observe that the accuracies of the classifiers generated by the FURIA and C4.5 algorithms, and of the FIRST solutions of PAES-RCS are, on average, comparable. As regards the complexity, we notice that the FURIA and PAES-RCS algorithms generate solutions that, on average, lie in the same complexity region. On the other hand, the solutions generated by the C4.5 algorithm are, on average, much more complex than the ones generated by the other comparison methods. Further, by analyzing the results on the training set, that we have omitted for the sake of brevity, we have realized that the C4.5 algorithm often suffers from overtraining problems. Table 17 shows the Friedman ranks and the Iman and Davenport p-values computed on the classification rates achieved by the classifiers generated by the FURIA and C4.5 algorithms and by the FIRST solutions of PAES-RCS. We observe that the statistical hypothesis of equivalence is not rejected. Thus, we can conclude that the classifiers generated by the FURIA and C4.5 algorithms are statistically equivalent, in terms of classification rate, to the FIRST solutions of PAES-RCS. From the interpretability point of view, the solutions generated by PAES-RCS are preferable to those generated by the FURIA and C4.5 algorithms. First of all, the C4.5 algorithm generates classifiers with a large number of rules. Further, the rules generated by the FURIA and C4.5 algorithms cannot be expressed in terms of linguistic values with an appropriate meaning. As an example, in the following we show the typical fuzzy rule generated by FURIA:
Rm : If X 1 is ½a1;m ; b1;m ; c1;m ; d1;m and . . . and X F is ½aF;m ; bF;m ; cF;m ; dF;m Then Y is C jm with RW m
ð5Þ
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
16
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
Table 13 Average total time for executing the overall algorithm, average time for generating the initial RB and ratio between the two average times expressed in percentage, for PAES-RCS, NSGAII-MG, and NSGAII-SG. Dataset
PAES-RCS
NSGAII-MG
NSGAII-SG
Total time (s)
Candidate generation time (s)
%
Total time (s)
Candidate rule generation time (s)
%
Total time (s)
Candidate rule generation time (s)
%
HAB HAY IRI MAM NEW TAE BUP APP PIM GLA SAH WIS CLE HEA WIN AUS VEH BAN HEP PAS WDB DER ION SON
10.5 6.7 5.5 25.4 9.4 9.3 16.5 11.2 31 19.3 32.1 26.2 34.7 17.7 14.5 39.6 73 33.4 14.4 7.2 26.8 24 39.3 25.4
0.0258 0.0256 0.0216 0.0392 0.0265 0.0287 0.0387 0.0239 0.0775 0.0435 0.0514 0.0433 0.0641 0.0384 0.0269 0.0725 0.1258 0.0722 0.0256 0.0215 0.0688 0.0742 0.0706 0.0686
0.25 0.38 0.39 0.15 0.28 0.31 0.23 0.21 0.25 0.23 0.16 0.17 0.18 0.22 0.19 0.18 0.17 0.22 0.18 0.29 0.26 0.31 0.18 0.27
247.5 249.2 254.0 439.8 326.5 253.8 309.0 244.1 396.5 676.3 293.8 348.1 732.3 263.0 285.4 502.1 966.7 436.3 269.8 247.2 980.4 1676.3 1061.0 3252.9
0.040 0.069 0.080 0.983 0.280 0.173 0.858 0.533 4.942 2.293 4.250 6.143 11.708 8.688 6.225 32.283 80.728 51.914 10.985 9.279 315.078 403.927 350.094 1523.019
0.02 0.03 0.03 0.22 0.09 0.07 0.28 0.22 1.25 0.34 1.45 1.76 1.60 3.30 2.18 6.43 8.35 11.90 4.07 3.75 32.14 24.10 33.00 46.82
32.7 93.9 110.6 336.6 134.8 80.0 261.1 154.4 405.8 609.3 362.8 371.7 751.4 277.9 345.0 511.6 1172.3 451.0 263.9 247.8 1222.6 1739.4 1109.6 3651.6
0.040 0.074 0.082 1.023 0.268 0.176 0.855 0.524 4.895 2.398 4.876 6.346 11.669 8.830 7.179 31.794 91.850 48.861 11.229 9.458 381.407 412.215 354.437 1669.156
0.12 0.08 0.07 0.30 0.20 0.22 0.33 0.34 1.21 0.39 1.34 1.71 1.55 3.18 2.08 6.21 7.84 10.83 4.26 3.81 31.20 23.70 31.94 45.71
Mean Std.
23.0 15.02
0.05 0.02
0.23 0.066
117.69 321.8
7.64 12.8
127.49 351.9
7.44 12.5
613.0 665.51
612.41 776.6
Table 14 Results of Wilcoxon signed-rank test on the hypervolume and epsilon dominance indicators for PAES-RCS vs PAES-RCS-WM, and for PAES-RCS vs PAES-RS. Comparison
Rþ
R
Hypothesis (a ¼ 0:05)
p-value
Hypervolume PAES-RCS vs PAES-RCS-WM PAES-RCS vs PAES-RS
195 300
105 0
Not rejected Rejected
0.2 1.19E7
Epsilon dominance PAES-RCS vs PAES-RCS-WM PAES-RCS vs PAES-RS
187 298
113 2
Not rejected Rejected
0.20 3.57E07
Table 15 Results of Wilcoxon signed-rank test on the accuracy of the FIRST solutions on the test set for PAES-RCS vs PAES-RCS-WM, and for PAES-RCS vs PAES-RS. Comparison
Rþ
R
Hypothesis (a ¼ 0:05)
p-value
PAES-RCS vs PAES-RCS-WM PAES-RCS vs PAES-RS
255 287
45 13
Rejected Rejected
0.0017 1.049E5
545
where af ;m ; df ;m and bf ;m ; cf ;m identify, respectively, the extremes of the support and of the core of the specific trapezoidal fuzzy set used in the rule Rm for the condition of the variable X f . Thus, even though FURIA generates compact fuzzy rule bases, the fuzzy sets, which appear in the antecedent of each rule, are not associated with linguistic fuzzy partitions of the input variables and therefore are hardly interpretable [25].
546
5. Conclusions
547
In this paper, we have proposed a fast and effective approach for designing fuzzy rule-based classifiers. In order to generate sets of classifiers characterized by different trade-offs between accuracy and complexity, we have exploited a multiobjective evolutionary RB learning scheme, denoted as rule and condition selection (RCS), and have concurrently learnt
542 543 544
548 549
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
17
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
Table 16 Average accuracy on the test set, TRL and number of rules for the solutions generated by the FURIA and C4.5 algorithms, and for the FIRST solutions of PAES-RCS. Dataset
FURIA
C4.5
PAES-RCS
AccTs
TRL
Rules
AccTs
TRL
Rules
AccTs
TRL
Rules
HAB HAY IRI MAM NEW TAE BUP APP PIM GLA SAH WIS CLE HEA WIN AUS VEH BAN HEP PAS WDB DER ION SON
75.44 83.13 94.66 83.89 96.30 43.08 69.02 85.18 74.62 72.41 69.69 96.35 56.20 80.00 96.60 85.22 71.52 64.65 84.52 80.55 96.31 95.24 91.75 82.14
5.8 20.9 7.1 5.1 14.2 8.6 29.2 5.0 17.0 35.7 9.6 38.6 20.1 26.6 12.5 11.2 84.7 38.5 9.7 6.4 30.7 28.4 30.8 28.7
3.8 9.0 4.5 3.3 7.1 5.0 11.1 3.8 7.5 13.3 5.3 13.5 6.7 9.4 6.4 8.0 25.1 13.9 5.4 4.2 11.6 10.7 12.1 10.8
71.56 83.12 95.2 83.97 92.09 59.60 67.82 85.84 74.67 69.15 70.77 95.60 48.48 77.40 93.82 84.05 75.57 63.28 86.25 75.0 94.02 95.25 90.59 72.11
5.0 65.0 8.8 14.0 38.0 230.0 142.0 5.0 111.0 187.0 74.0 42.0 303.0 104.0 12.0 189.0 918.0 312.0 27.0 7.2 49.0 35.0 110.0 95.0
3.0 12.0 5.2 5.0 9.0 34.0 26.0 3.0 20.0 30.0 15.0 11.0 46.0 20.0 5.0 31.0 98.0 34.0 8.0 3.8 12.0 8.0 17.0 19.0
72.65 84.03 95.33 83.37 95.35 60.81 68.67 85.09 74.66 72.13 70.92 96.46 59.06 83.21 93.98 85.80 64.89 67.56 83.21 76.84 95.14 95.43 90.40 77.00
17.3 12.5 11.5 14.1 11.5 21.4 21.0 6.3 19.9 28.7 28.7 21.3 50.0 21.0 15.4 24.6 37.3 36.0 21.0 8.4 16.7 22.0 33.7 30.5
11.7 9.6 8.5 9.4 8.5 15.1 12.2 5.6 13.6 17.0 18.3 15.4 22.8 14.3 11.1 14.3 14.9 21.0 14.3 7.6 11.0 17.7 19.9 17.2
Mean Std.
80.35 13.72
21.9 17.5
8.8 4.8
79.38 12.70
128.4 191.3
19.8 20.4
80.50 11.75
22.0 10.4
13.7 4.5
Table 17 Results of the statistical tests on the accuracy obtained on the test set by the classifiers generated by the FURIA and C4.5 algorithms and by the FIRST solutions of PAES-RCS.
550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571
Algorithm
Friedman rank
Iman and Davenport p-value
Hypothesis
FURIA C4.5 PAES-RCS
1.8333 2.375 1.7917
0.076
Not rejected
the membership function parameters during the same evolutionary optimization process. In particular, the RCS strategy chooses a reduced number of rules from a set of candidate rules and a reduced number of conditions for each selected rule. In order to generate the set of candidate rules that represents a very good starting point for exploring the search space, we have exploited the well-known C4.5 algorithm. First, the input training patterns have been previously transformed into ordered categories using an initial fuzzy partition for each input variable. Then, the set of candidate rules is extracted from the decision tree generated by the C4.5 algorithm. The proposed approach has been tested on twenty-four classification benchmarks. We have first performed a study on the number of fitness evaluations to be used as stopping criterion. To this aim we have compared in terms of classification accuracy on the test set the solutions obtained by our approach after 10,000, 50,000, and 1,000,000 fitness evaluations. We have pointed out that 50,000 fitness evaluations are sufficient to obtain Pareto front approximations statistically equivalent to the ones obtained after 1,000,000 fitness evaluations. Then, we have compared the results of our approach after 50,000 fitness evaluations with the ones achieved by two state-of-the-art multi-objective evolutionary fuzzy classifiers, namely NSGAII-MG and NSGAII-SG, based on a classical rule selection scheme. This comparison has been performed by considering two aspects: the effectiveness of the evolutionary process and the generalization capabilities of the generated solutions. In particular, we have used two well-known indicators, namely the epsilon dominance and the hypervolume, to evaluate the first aspect, and accuracy on the test set of the most accurate solution to evaluate the second aspect. As regards the first aspect, we have statistically verified that the solutions generated by our approach are statistically equivalent to the ones generated by NSGAII-MG and NSGAII-SG, even if our approach executes only the 5% of fitness evaluations employed by NSGAII-MG and NSGAII-SG. Furthemore, in most of the datasets, the most accurate solutions generated by our approach outperform the ones generated by NSGAII-MG and NSGAII-SG, although our solutions are characterized by a higher complexity. Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1
18
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
578
We have also compared the three algorithms in terms of execution times and have shown that the total execution time needed by our approach is much lower than the one needed by both NSGAII-MG and NSGAII-SG. Finally, we have compared the fuzzy rule-based classifiers generated by our approach with the classifiers generated by two well-known non-evolutionary-based learning algorithms, namely FURIA and C4.5, pointing out that the results achieved by our classifiers are statistically equivalent, in terms of accuracy, to the ones achieved by the FURIA and C4.5 algorithms. On the other hand, from the semantic point of view, the linguistic fuzzy rule-based classifiers generated by our approach are more interpretable than the classifiers generated by the FURIA and C4.5 algorithms.
579
Acknowledgements
580 582
We would like to express our sincere gratitude to R. Alcalá, Y. Nojima, F. Herrera and H. Ishibuchi, authors of the paper [3], for giving us the codes of the NSGAII-MG and NSGAII-SG algorithms and therefore allowing us to perform the comparisons discussed in Section 4.
583
References
584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641
[1] R. Alcalá, P. Ducange, F. Herrera, B. Lazzerini, F. Marcelloni, A multiobjective evolutionary approach to concurrently learn rule and data bases of linguistic fuzzy-rule-based systems, IEEE Trans. Fuzzy Syst. 17 (2009) 1106–1122. [2] R. Alcalá, M.J. Gacto, F. Herrera, A fast and scalable multiobjective genetic fuzzy system for linguistic fuzzy modeling in high-dimensional regression problems, IEEE Trans. Fuzzy Syst. 19 (2011) 666–681. [3] R. Alcalá, Y. Nojima, F. Herrera, H. Ishibuchi, Multiobjective genetic fuzzy rule selection of single granularity-based fuzzy classification rules and its interaction with the lateral tuning of membership functions, Soft Comput. 15 (2011) 2303–2318. [4] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, Multiple-Valued Logic Soft Comput. 17 (2011) 255–287. [5] J.M. Alonso, L. Magdalena, Editorial: special issue on interpretable fuzzy systems, Inform. Sci. 181 (2011) 4331–4339. [6] J.M. Alonso, L. Magdalena, G. González-Rodríguez, Looking for a good fuzzy system interpretability index: an experimental approach, International Journal of Approximate Reasoning 51 (2009) 115–134. [7] M. Antonelli, P. Ducange, B. Lazzerini, F. Marcelloni, Multi-objective evolutionary learning of granularity, membership function parameters and rules of Mamdani fuzzy systems, Evolution. Intell. 2 (2009) 21–37. [8] M. Antonelli, P. Ducange, B. Lazzerini, F. Marcelloni, Learning concurrently data and rule bases of Mamdani fuzzy rule-based systems by exploiting a novel interpretability index, Soft Comput. 15 (2011) 1981–1998. [9] M. Antonelli, P. Ducange, B. Lazzerini, F. Marcelloni, Learning knowledge bases of multi-objective evolutionary fuzzy systems by simultaneously optimizing accuracy, complexity and partition integrity, Soft Comput. 15 (2011) 2335–2354. [10] M. Antonelli, P. Ducange, F. Marcelloni, Genetic training instance selection in multi-objective evolutionary fuzzy systems: a co-evolutionary approach, IEEE Trans. Fuzzy Syst. 20 (2012) 276–290. [11] B. Anuradha, V. Reddy, Cardiac arrhythmia classification using fuzzy classifiers, J. Theor. Appl. Inform. Technol. 4 (2008) 353–359. [12] S. Bleuler, M. Laumanns, L. Thiele, E. Zitzler, PISA—a platform and programming language independent interface for search algorithms, in: C.M. Fonseca et al. (Eds.), Conference on Evolutionary Multi-Criterion Optimization (EMO 2003), LNCS, vol. 2632, Springer, Berlin, 2003, pp. 494–508. [13] A. Botta, B. Lazzerini, F. Marcelloni, D.C. Stefanescu, Context adaptation of fuzzy systems through a multi-objective evolutionary approach based on a novel interpretability index, Soft Comput. 13 (2009) 437–449. [14] J. Casillas, P. Martínez, A. Benítez, Learning consistent, complete and compact sets of fuzzy rules in conjunctive normal form for regression problems, Soft Comput. 13 (2009) 451–465. [15] Z. Chi, H. Yan, T. Pham, Fuzzy Algorithms with Applications to Image Processing and Pattern Recognition, World Scientific, Singapore; River Edge, NJ, 1996. [16] M.E. Cintra, H.A. Camargo, Feature subset selection for fuzzy classification methods, in: E. Hullermeier, R. Kruse, F. Hoffmann (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems: Theory and Methods, Communications in Computer and Information Science, vol. 80, Springer, Berlin, Heidelberg, 2010, pp. 318–327. [17] M. Cococcioni, P. Ducange, B. Lazzerini, F. Marcelloni, A Pareto-based multi-objective evolutionary approach to the identification of Mamdani fuzzy systems, Soft Comput. 11 (2007) 1013–1031. [18] W.W. Cohen, Fast effective rule induction, in: Proceedings of the 12th International Conference on Machine Learning, Morgan Kaufmann, 1995, pp. 115–123. [19] O. Cordón, M.J. Jesus, F. Herrera, L. Magdalena, P. Villar, A multiobjective genetic learning process for joint feature selection and granularity and contexts learning in fuzzy rule-based classification systems, in: J. Casillas, O. Cordón, F. Herrera, L. Magdalena (Eds.), Interpretability Issues in Fuzzy Modeling, Studies in Fuzziness and Soft Computing, vol. 128, Springer, Berlin, Heidelberg, 2003, pp. 79–99. [20] P. Ducange, B. Lazzerini, F. Marcelloni, Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets, Soft Comput. 14 (2010) 713–728. [21] P. Ducange, F. Marcelloni, Multi-objective evolutionary fuzzy systems, in: A. Fanelli, W. Pedrycz, A. Petrosino (Eds.), Fuzzy Logic and Applications, Lecture Notes in Computer Science, vol. 6857, Springer, Berlin, Heidelberg, 2011, pp. 83–90. [22] M. Fazzolari, R. Alcalá, Y. Nojima, H. Ishibuchi, F. Herrera, A review of the application of multi-objective evolutionary fuzzy systems: current status and further directions, IEEE Trans. Fuzzy Syst. 21 (2013) 45–65. [23] M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, J. Am. Stat. Assoc. 32 (1937) 675–701. [24] M.J. Gacto, R. Alcalá, F. Herrera, Integration of an index to preserve the semantic interpretability in the multiobjective evolutionary rule selection and tuning of linguistic fuzzy systems, IEEE Trans. Fuzzy Syst. 18 (2010) 515–531. [25] M.J. Gacto, R. Alcalá, F. Herrera, Interpretability of linguistic fuzzy rule-based systems: an overview of interpretability measures, Inform. Sci. 181 (2011) 4340–4360. [26] A. Gonzalez, R. Perez, Slave: a genetic learning system based on an iterative approach, IEEE Trans. Fuzzy Syst. 7 (1999) 176–191. [27] F. Herrera, Genetic fuzzy systems: taxonomy, current research trends and prospects, Evolution. Intell. 1 (2008) 27–46. [28] F. Herrera, M. Lozano, J.L. Verdegay, Tackling real-coded genetic algorithms: operators and tools for behavioural analysis, Artif. Intell. Rev. 12 (1998) 265–319. [29] S.Y. Ho, H.M. Chen, S.J. Ho, T.K. Chen, Design of accurate classifiers with a compact fuzzy-rule base using an evolutionary scatter partition of feature space, IEEE Trans. Syst. Man Cybernet. Part B 34 (2004) 1031–1044. [30] S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian J. Stat. 6 (1979) 65–70. [31] J. Huhn, E. Hullermeier, FURIA: an algorithm for unordered fuzzy rule induction, Data Min. Knowl. Discov. 19 (2009) 293–319. [32] R.L. Iman, J.H. Davenport, Approximations of the critical region of the Friedman statistic, Commun. Stat. – Theory Methods Part A 9 (1980) 571–595.
572 573 574 575 576 577
581
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014
INS 10940
No. of Pages 19, Model 3G
23 June 2014 Q1 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678
M. Antonelli et al. / Information Sciences xxx (2014) xxx–xxx
19
[33] H. Ishibuchi, T. Murata, I.B. Türksßen, Single-objective and two-objective genetic algorithms for selecting linguistic rules for pattern classification problems, Fuzzy Sets Syst. 89 (1997) 135–150. [34] H. Ishibuchi, T. Nakashima, T. Murata, Three-objective genetics-based machine learning for linguistic rule extraction, Inform. Sci.: An Int. J. 136 (2001) 109–133. [35] H. Ishibuchi, T. Nakashima, M. Nii, Classification and Modeling with Linguistic Information Granules: Advanced Approaches to Linguistic Data Mining (Advanced Information Processing), Springer Verlag New York, Inc., Secaucus, NJ, USA, 2004. [36] H. Ishibuchi, Y. Nojima, Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning, Int. J. Approx. Reason. 44 (2007) 4–31. [37] H. Ishibuchi, T. Yamamoto, Fuzzy rule selection by multi-objective genetic local search algorithms and rule evaluation measures in data mining, Fuzzy Sets Syst. 141 (2004) 59–88. [38] H. Ishibuchi, T. Yamamoto, T. Nakashima, Hybridization of fuzzy GBML approaches for pattern classification problems, IEEE Trans. Syst. Man Cybernet. Part B 35 (2005) 359–365. [39] L.I. Kuncheva, Fuzzy Classifier Design, Studies in Fuzziness and Soft Computing, vol. 49, Springer, 2000. [40] E. Mansoori, M. Zolghadri, S. Katebi, SGERD: a steady-state genetic algorithm for extracting fuzzy classification rules from data, IEEE Trans. Fuzzy Syst. 16 (2008) 1061–1071. [41] C. Mencar, A.M. Fanelli, Interpretability constraints for fuzzy information granulation, Inform. Sci. 178 (2008) 4585–4618. [42] P. Pulkkinen, H. Koivisto, Fuzzy classifier identification using decision tree and multiobjective evolutionary algorithms, Int. J. Approx. Reason. 48 (2008) 526–543. [43] P. Pulkkinen, H. Koivisto, A dynamically constrained multiobjective genetic fuzzy system for regression problems, IEEE Trans. Fuzzy Syst. 18 (2010) 161–177. [44] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993. [45] G. Schaefer, M. Závišek, T. Nakashima, Thermography based breast cancer analysis using statistical features and fuzzy classification, Pattern Recogn. 42 (2009) 1133–1137. [46] K. Trawinski, O. Cordón, A. Quirin, A Study on the use of multiobjective genetic algorithms for classifier selection in FURIA-based fuzzy multiclassifiers, Int. J. Comput. Intell. Syst. 5 (2012) 231–253. [47] C.H. Tsang, S. Kwong, H. Wang, Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection, Pattern Recogn. 40 (2007) 2373–2391. [48] D. Whitley, A genetic algorithm tutorial, Stat. Comput. 4 (1994) 65–85. [49] F. Wilcoxon, Individual comparisons by ranking methods, Biometr. Bull. 1 (1945) 80–83. [50] I.H. Witten, E. Frank, M.A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed., Morgan Kaufmann Series in Data Management Sys, Morgan Kaufmann, 2011. [51] H. Wu, J.M. Mendel, Classification of battlefield ground vehicles using acoustic features and fuzzy logic rule-based classifiers, IEEE Trans. Fuzzy Syst. 15 (2007) 56–72. [52] S.M. Zhou, J.Q. Gan, Low-level interpretability and high-level interpretability: a unified view of data-driven interpretable fuzzy system modelling, Fuzzy Sets Syst. 159 (2008) 3091–3131. [53] E. Zitzler, L. Thiele, M. Laumanns, C.M. Fonseca, V.G. da Fonseca, Performance assessment of multiobjective optimizers: an analysis and review, IEEE Trans. Evolution. Comput. 7 (2002) 117–132.
679
Q1 Please cite this article in press as: M. Antonelli et al., A fast and efficient multi-objective evolutionary learning scheme for fuzzy rule-based classifiers, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.06.014