Symbolic interpretation of artificial neural networks based on multiobjective genetic algorithms and association rules mining

Symbolic interpretation of artificial neural networks based on multiobjective genetic algorithms and association rules mining

Accepted Manuscript Title: Symbolic interpretation of artificial neural networks based on multiobjective genetic algorithms and association rules mini...

712KB Sizes 0 Downloads 45 Views

Accepted Manuscript Title: Symbolic interpretation of artificial neural networks based on multiobjective genetic algorithms and association rules mining Authors: Dounia Yedjour, Abdelkader Benyettou PII: DOI: Reference:

S1568-4946(18)30455-1 https://doi.org/10.1016/j.asoc.2018.08.007 ASOC 5035

To appear in:

Applied Soft Computing

Received date: Revised date: Accepted date:

31-10-2017 8-7-2018 3-8-2018

Please cite this article as: Yedjour D, Benyettou A, Symbolic interpretation of artificial neural networks based on multiobjective genetic algorithms and association rules mining, Applied Soft Computing Journal (2018), https://doi.org/10.1016/j.asoc.2018.08.007 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Symbolic interpretation of artificial neural networks based on multiobjective genetic algorithms and association rules mining

SC RI PT

Dounia YEDJOUR1,*, Abdelkader BENYETTOU1 1

Signal Image Parole Laboratory (SIMPA), Department of Computer Science, Université des Sciences et de la Technologie d'Oran Mohamed Boudiaf, USTO-MB, BP 1505, El M'naouer, 31000 Oran Algérie {dounia.yedjour, abdelkader.benyettou}@univ-usto.dz

Highlights

A

  

U



Multiobjective genetic algorithm based on Pareto-optimal set is used to extract knowledge from ANN. The rule set should satisfy high fidelity, high accuracy and a high comprehensibility. Invalid rules having low values of support, confidence and lift are removed. Redundant rules are also removed by using the sequential covering method. The lift allows extracting the interesting rare rules in the case of unbalanced datasets.

N



CC

EP

TE

D

M

Abstract – Rule extraction from neural networks is an important task. In practice, decision makers often settle for using less accurate, but comprehensible models, typically decision trees where the solutions are given in graphical form easily interpretable. The black-box rule extraction techniques, operating directly on the input-output relationship, are clearly superior to the restricted open-box methods, normally tailored for a specific architecture. This is especially important since most data miners today will use some kind of ensemble (instead of a single model) to maximize accuracy. Consequently, the ability to extract rules from any opaque model is a key demand for rule extraction techniques. This paper proposes a new multiobjective genetic method to extract knowledge from trained artificial neural network by using the association rules technique. The main aim of this hybridization is to extract the optimal rules from the neural network for further classification. The algorithm consists of two stages: the rule filtering phase which eliminates misleading rules by taking into account the support, the confidence and the lift measures, then, rule set optimization phase which finds the set of optimal rule sets by considering fidelity, coverage and complexity measures. The algorithm is evaluated on 05 UCI datasets. The experimental results show that the proposal provides interesting rules. Accuracy and comprehensibility are clearly improved, and subsequently, it can become a challengeable and trustful research field in the area of neural network rule extraction.

A

Keywords: Neural Networks, Class Association Rules, Genetic Algorithm, Multiobjective optimization

1.

Introduction

Artificial Neural Networks (ANNs) are powerful machine learning techniques. They have been successfully used in wide array of domains such as medical, industry, science, financial, economy, etc….The reason of their popularity is their ability to learn from examples, their high degree of accuracy on generalization, their ability to solve both unsupervised and supervised problem, their ability to approximate the nonlinear relationship between the inputs and outputs, their pattern classification abilities [1] and their ability to provide solution of noisy input data.

D

M

A

N

U

SC RI PT

However, ANN’s present two major drawbacks including: the over-fitting problem causing by complex network structure. It occurs when a model begins to memorize training data rather than learns to generalize trends from it [2]. Much work has been devoted to reduce the over-fitting phenomenon by pruning of redundant and no significant network connections and units [3]-[5]. Also, ANNs store their knowledge in the form of synaptic weights, so it is difficult to understand their internal learning process. That is why they are called black boxes. In any domain such as medical diagnostic [2], credit scoring [3], fault detection [6], traffic crash frequency [7], etc...The justification of the decisions taken by the ANN is even more important than its performance. In practice, decision makers often settle for using less accurate, but comprehensible models, typically decision trees. To overcome this problem, several works have been investigated the possibility to extract symbolic rules from trained ANNs. Recently, novel application of rule extraction from ANN based on other machine learning tools has been developed [3,7,8]. However, to improve the performance of a rule extraction technique, ANN can be combined with Genetic algorithm [8-9], decision tree [10], association rules [11], SVM [12], etc... Generally, extracted rules are evaluated in terms of three measures: fidelity, accuracy and comprehensibility [13]. Fidelity is ability for the extracted rules set to mimic accurately the model from which they were extracted. Accuracy refers to how well the extracted rules set are able to make accurate predictions on previously unseen cases. Whereas the comprehensibility is used to extract a human-comprehensible description from the model. The number of rules and the number of premises can be used as measures for the model comprehensibility. The main aim of rule extraction techniques is to find compromise between the measures [14]. In certain contradictory situations, an increase in accuracy coincides with a decrease in fidelity. This dilemma appears when the model is learned from complex dataset. Therefore, a complex dataset leads to a complex model. However, a preliminary step to select interesting associations between patterns remains necessary. Inspired by the Association rules concepts and the Evolutionary Algorithms, a new rule extraction technique from neural networks has been developed. The motivation of this study is to improve the accuracy, the fidelity and the comprehensibility of the ANN rules since most rule extraction algorithms tend to improve one measure to the detriment of another. We employed the basic concepts of association rules mining to filter the misleading rules. The filtering phase is used to look for reliable relations between the premises and the consequent by calculating the support, the confidence and the lift. As a result, fewer ANN rules with fewer conditions (premises) are generated, thereby increasing the comprehensibility considerably. The accuracy and fidelity are significantly improved. Our algorithm is the first that used the lift measure during the rules extraction process. The lift allows extracting the rare rules in the case of unbalanced databases such as Glass and Pima. Our algorithm can be easily executed over any dataset, with different sizes, classes and number of attributes. The structure of the paper is as follows: Section 2 describes the general rule extraction methodology in more detail. Section 3 describes our new ANN rule extraction technique. Simulation results and comparison to other methods for ANN rule extraction are analysed and discussed in Section 4. Finally, Section 5 concludes this paper.

Related work

2.1.

Rules Extraction from Neural Networks

EP

TE

2.

A

CC

There are two main approaches to extract rules from trained neural networks": local and global. Local methods first look for the weight combinations that activate the hidden / output neuron, then, these combinations will be used to generate the symbolic rules [15],[16]. These methods quickly become inefficient when the size of neural network become large. On the other hand, the global methods are based on the analysis of the relationship between the input and output of ANN [17, 18]. In this case, a neural network is considered as a black box. The black-box rule extraction techniques, operating directly on the input-output relationship, are clearly superior to the restricted open-box methods, normally tailored for a specific architecture. Consequently, the ability to extract rules from any opaque model is a key demand for rule extraction techniques. A survey of several rule extraction methods can be found in [19-20]. One of the first rule-extraction methods from neural networks was introduced by Saito and Nakano [21]. The authors treated the network as black box. The rules were generated by observing the effect on the network (feedforward) output by changing the inputs. To avoid the combinatorial explosion, they restricted the number of antecedents in a rule. However, even with this limitation, the number of rules extracted for a relatively simple problem domain was exceedingly large. Gallant [22] developed a global rule-extraction method that it operates by propagating activation intervals through the network. Like the method of Saito and Nakano, this method limits the number of antecedents of each rule. Towel and Shavlik [15] showed how to use ANNs for rule refinement. They developed two local rule-extraction algorithms known as SUBSET and MofN. Subset method is the local approach. It is based on the analysis of the weights that make a specific neuron be active. The size of rules space grows exponentially with inputs dimensionality. MofN searches not for conjunctive rules like subset, but for rules of the form If (N of the following M antecedents are true) then

A

CC

EP

TE

D

M

A

N

U

SC RI PT

(the concept designated by the unit is true). They indicate that the rule sets derived by MofN are more concise and comprehensible than subset and are approximately equal to the accuracy of the networks from which they are extracted. Thrun [17] described the validity interval analysis (VIA) approach, Like Gallant’s method, VIA tests rules by propagating activation intervals through a network. Activation ranges are used to prove rule correctness. The discrete and real-valued features are not involved by his method. Taha and Ghosh [16] proposed three different rule-extraction techniques to extract rules from trained neural networks: BIO-RE, Partiel-Re and Full-re. They demonstrated through the experimental results that Full_Re is a universal extractor. It can extract accurate rules with certainty factors from networks trained with continuous, nominal, and binary inputs. Markowska-Kaczmar [18] presented two ANN rule extraction techniques based on evolutionary algorithms namely: REX and GEX. REX uses prepositional fuzzy rules while GEX uses crisp rules. She presented the way of coding and evaluation process of an individual for both approaches. In [3], the ANN is trained with augmented discretized continuous attributes. The authors started to improve accuracy of ANN by pruning of redundant networks units or connections. Then classification rules have been extracted from pruned network by analyzing the connections and the activation neurons. Junqué and Martens [20] proposed a rule extraction method based on active learning. The method has shown the robust behaviour in the presence of noise in the data by adding new artificial data and their predictions. Zeng et al. [7] proposed an unsupervised ANN to approximate the relationship between crash frequency and risk factors. The ANN is first pruned to overcome the over-fitting problem. Then a rule extraction method has been proposed in order to justify the decisions reached by the model. Tran and Garcez [23] proposed an algorithm that can insert and extract knowledge from ANN trained by deep learning. The authors used logical rules that they called confidence rules in order to represent the quantitative reasoning in deep networks. Etemadi et al [8] developed two learning methods (Neural Network and Rule Extraction from ANN by genetic algorithm) to evaluate Earning Per Share (EPS) Forecast that can help the Financial managers and investors to make good decisions. Authors demonstrated that the second technique is significantly more accurate than the first. Shinde and Kulkarni [12] proposed the modified fuzzy min–max neural network classifier (MFMMN) that can process both the continuous and discrete attributes. After pruning, MFMMN extracts the rules measured in terms of fidelity, accuracy and comprehensibility to explain its classification decision. Augasta and Kathirvalavakumar [24] proposed a rule extraction method based on reverse engineering technique called RxREN. The algorithm starts by pruning the insignificant input neurons from the trained neural networks and constructs the classification rules only with significant input neurons based on reverse engineering technique. In [25], authors proposed rule extraction algorithm from multilayer perceptrons that consists essentially of two steps. First, a clustering genetic algorithm is applied to find clusters of hidden unit activation values. Then, classification rules describing these clusters, in relation to the inputs, are generated. The proposed algorithm combines both local and global approaches. Zilke et al [26] developed a local rule extraction algorithm from deep neural network called DeepRED (Deep neural network Rule Extraction via Decision tree induction). It finds decision trees between any two layers. The trees are then transformed to rule sets. The authors have chosen the pruning technique of RxREN [24] to eliminate irrelevant inputs. In [27], authors presented a local algorithm to extract decision trees from multilayer feedforward neural network. The algorithm consists of three phases: neural network pruning, activation value discretization - clusterization, and finally decision tree building. Fu and Wang [28] proposed a novel method to extract rules from the radial basis function (RBF) neural network classifier. The authors started to delete irrelevant weights connecting hidden units to output units. In order to obtain high rule accuracy, a genetic algorithm is used for adjusting the interval for each premise of each rule. In [29], authors proposed a multiobjective evolutionary algorithm based on Pareto approach to extract rules from neural networks that perform the task of approximation. The algorithm takes into account different type of attributes (enumerative, binary, continuous and discrete). 2.2.

Using GA to extract rules from ANN

ANN can be combined with Genetic Algorithm (GA) to improve the performance of the rule extraction technique. Genetic Algorithms work with a population of candidate solutions, a fitness function is used to evaluate the quality of an individual. Individuals with the highest scores will be selected to create a new population by applying the crossover and mutation operations. This process is repeated until a termination criterion is met Several works have used genetic algorithms to optimize the topology and the interconnection weights of the neural network [30-31] in order to improve his accuracy. Therefore, a more accurate network lead to more accurate classification rules [3]. In this case, each individual of the population represents a neural network topology. In the rule extraction problem, genetic algorithm is used to extract rules from ANN. There are two possibilities [18]: each

2.3.

SC RI PT

individual of the population encodes one rule. This is called Michigan approach. The other case, each individual of the population encodes a set of rules. This is called Pitt approach. In the hybrid learning methods that combine neural networks with GA, the fitness function is used to evaluate the quality of rules with respect to three criteria: comprehensibility, fidelity, and accuracy, so the quality of extracted rules is improved by increasing their comprehensibility, fidelity, and accuracy [16]. However, the objectives are generally conflicting; the goal of a multiobjective rule extraction algorithm is to find a set of solutions that present the best compromise between the conflicting objectives.

Multiobjective genetic algorithm

Many optimization strategies have been proposed to solve multiobjective problems including; the weighted sum method, the Ɛ-constraint method, the goal programming and some others. The weighted sum method [32] can be used to aggregate the objectives into a single-objective; the aim is to converge to the same results. Here, the fitness function is given by:

Fitness =  f1(x) +  f2(x) +...+  fk(x)

U

Where k≥ 2 is the number of the objective functions, x=(x1,…, xn) is the vector of decision variables. ,  and  are the weights coefficients. f1, f2 and fk are the objectives functions.

M

A

N

In the case of the rule extraction problem, the goal of a multiobjective genetic algorithm is to find a vector x* that maximizes or minimizes (according to the context) the Fitness value. X represents a rule (in Michigan approach) or a rule set (in Pitt approach), f1 is the fidelity measure, f2 is the coverage measure, f3 is the comprehensibility measure, etc…. Unfortunately, it is difficult to choose an appropriate weight of each objective. Small perturbations in the weights can lead to very different solutions [32]. Another optimization approach offers multiobjective optimization in Pareto sense. Here, the fitness function is given by:

Fitness = {f1(x), f2(x),...,fk(x)}

D

Let xE (set of solutions), x dominates another solution x’E, in a minimization context, if:  i  [1..k] fi(x) ≤ fi(x’) and there exists at least one j such that fj(x) < fj(x’).

2.4.

EP

TE

In this case, there is no single optimal solution, but rather a set of alternative solutions, generally denoted as a set of Pareto-optimal solutions which forms the Pareto frontier. These solutions are optimal in the wider sense that no other solution in the search space is superior to them when all objectives are considered [33]. The final choice will be left to the decision-maker [34]. Rules Extraction from dataset

CC

Association Rule mining (AR) is another popular technique in the field of data mining. It was introduced by Agrawal et al [35]. It aims to extract interesting associations among any attributes in the dataset [36-39]. The extraction will concern only the valid rules, i.e., with a support > minsup and confidence > minconf. Where minsup is the minimum support threshold and minconf is the minimum confidence threshold.

A

Support of a rule: is defined as the number of patterns that contain both X and c divided by the total number of patterns in the dataset. The number of patterns that contain X U { c} is called coverage (cover): Sup(Xc) =

cover(X c) N

=

cover(ri ) N

(1)

Where N is the number of patterns in the dataset. Confidence: is defined as the percentage of patterns that contain X U {c} to the total number of examples that contains X. Conf(X  c)=

Sup(X  c) Sup(X)

=

cover(X  c) cover(X)

(2)

Unfortunately, a rule can have an excellent support and an excellent confidence without being interesting [40]. Hence the need to use another measure which characterizes a rule, named, LIFT which is given by [41]: Lift(X c) =

Conf(X c)

(3)

Sup(c)

SC RI PT

Where Sup (c), is defined as the number of examples that include the class c divided by the total number of examples. Lift provides information about the change in probability of X in presence of c. A lift value greater than 1 indicates that the examples containing the class c tend to contain the attributes X more often than examples that do not contain c and makes this rule potentially useful for predicting the consequent in future datasets. Lift is equal to 1 means that X and c are independent, and if it is lower than 1 it means that X and c are negatively correlated. In the hybrid learning methods that combine neural networks with AR, AR is used generally for reducing the dimension of dataset and neural network is used for intelligent classification [42]. In our approach, AR is used to filter rules before that they are used by the genetic algorithm.

3.

Proposed Algorithm

Before explaining the details of the proposed method, Table 1 summarizes the notations used in the following sections.

U

Table 1 Notation table

Descriptions the support value of the rule (ri ) calculated by Eq (1) the confidence value of (ri) calculated by Eq (2) the lift value of (ri) calculated by Eq(3) the confidence threshold the support threshold the number of examples correctly classified by Rule the number of patterns of a given class

M

A

N

Notations Sup(ri) Conf(ri) Lift(ri) Minsup Minconf cover (Rule); Np

TE

D

The proposed algorithm is a multiobjective evolutionary algorithm. It extracts binary rules with IF Then Form from Neural network trained with binary inputs. As shown in Fig. 1, our algorithm consists of three major components: a trained neural component, a genetic component and rule filtering component. Generally, a dataset of binary inputs/outputs can be written in a logical sum of and operations. 𝑗𝑐𝑙𝑎𝑠𝑠

𝐶𝑙𝑎𝑠𝑠𝐾 = ∑𝑗=1 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑗 where

𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑗 = ∏𝑖𝑚𝑎𝑥 𝑖=1 𝑥𝑗𝑖

(4)

ANN compoment:

CC

3.1.

EP

jclass is the number of examples in the dataset satisfiynig classk, j is the example index, xji is the i-th attribute for the j-th example , imax is the number of attributes. Each row of dataset (example j) can be viewed as rule (Rj), where the attribute xji represent the premisse and desired output represent the conclusion of rule [9].

A

We chose to work with an MLP (Multi-Layer Perceptron) trained by backpropagation algorithm. An MLP is a neural network that contains an input layer, one or more hidden layers and an output layer (see Fig. 2). Each unit has an internal state called the activation state (ai). The units propagate their activation states to the other units by the weighted arcs called synaptic weights(wij). In supervised learning the output of the network is compared with the desired output. The calculated error is back-propagated to the previous layers to adjust the synaptic weight values. The learning stops when the error reaches a certain threshold or after several epochs. The aim is to induce or generalize knowledge from a training dataset. At the end of the learning phase, we obtain the optimal weight values. These values represent the knowledge of neural network but it is difficult to interpret them by a user. The aim of the genetic module is to represent these knowledge in intelligible form. 3.2.

Multi-objective Genetic algorithm:

In the rule extraction problem , the aim of the multi-objective genetic algorithm is to search in the space of the rule set E, those that represent the knowledge of the neural network according to three criteria: fidelity, coverage and comprehensibility. For this process, it is necessary to satisfy a maximum of fidelity and coverage and a minimum of complexity. Mathematically, the problem can be formulated as follows:

Min (−𝑓𝑖𝑑𝑒𝑙𝑖𝑡𝑦(𝑥), −𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒(𝑥), 𝑐𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦(𝑥))) (5) x={r1, r1,.. rn}E  r, rx, {sup(r)≥minsup, conf(r)≥ minconf, lift(r)>1}

A

CC

EP

TE

D

M

A

N

U

SC RI PT

x is the vector of decision variables (a rule set), E is called decision space which represents the set of potential solutions for the optimization problem, fidelity, coverage and complexity are the objective functions, sup(r)≥minsup, conf(r)≥ minconf, and lift(r)>1 represent the constraints which need to be satisfied by each rule in x. The general idea of the method is shown in algorithm 1.

Fig. 1. The proposed approach

SC RI PT U

N

Fig 2. Mlp neural network with one hidden layer

CC

EP

TE

D

M

A

Algorithm 1 For every individual in the population do Decode individual to the set of rules For every rule r do Calculate Sup (r) Calculate Conf (r) Calculate Lift (r) If (Sup (r) < minsup) or (Conf (r) < minconf) or (Lift (r) <= 1) Then Remove rule from the rule set End If End For Using the sequential covering approach to form new rule set Calculate the fidelity part of the fitness function Calculate the coverage part of the fitness function Calculate the complexity part of fitness function End For Generate Pareto optimal solutions based on the fitness evaluation

A

Three objective functions are used in the rule set optimization: fidelity, coverage (accuracy) and complexity: - Fidelity: is defined as the proportion of the number of examples that have been correctly classified by the neural network among all the examples that have been correctly classified by the rule set. It defines the ability of the extracted rules to mimic the behaviour of the network from which they were extracted [24]. - Complexity: is defined by the number of rules in the rule set. The lower is the value of complexity; the more comprehensible rules are in the final set. It should be noted that no condition of the maximum length rule is provided in our experiments. - Coverage of a rule set (cover_set): is defined as the ratio of examples which are covered by the rule set to the total number of examples.

𝐶𝑜𝑣𝑒𝑟_𝑠𝑒𝑡 =

∑𝑟 ∈𝑟𝑢𝑙𝑒_𝑠𝑒𝑡 𝐶𝑜𝑣𝑒𝑟(𝑟𝑖 ) 𝑖 𝑁𝑐𝑗

Where Ncj is the number of patterns in the class j.

(6)

The genetic module consists of the following cycle: creation of initial rules, filtering of rules, selection of the best rules and creation of new rules by applying genetic operators. 3.2.1.

Creation of initial rules:

U

SC RI PT

The algorithm starts by generating an initial population on the basis of the table of truth, containing all possible combinations of Boolean inputs, then, some modifications are done by replacing the randomly inputs by the “-1” value (pruning of some of the attributes). Each combination represents a potential solution and consists of a vector of rules. This approach called Pittsburgh approach (see Fig. 3). Each rule is coded as a string of integer values between -1 and 1, each value represents the value of an attribute: “-1” means that the attribute is not involved in the rule,” 0” means that the attribute is written “Not (Ai)” in the rule generated and “1” means that the attribute is written as “Ai” in the rule generated. Only the premises are represented in the chromosome because all the individuals of the population are associated with the same class. For example, the rule r2 in Fig. 3 is written as: “If Not (A2) and A4 then Cj”.

Filtering of rules:

A

3.2.2.

N

Fig. 3. The form of the chromosome for a class j

A

CC

EP

TE

D

M

The evaluation of one individual needs a cycle, which is composed of two steps (see Fig. 4). 1. Pruning uninteresting rules: This phase consists of eliminating rules whose support and confidence and lift values are smaller than thresholds. 2. Pruning redundant rules: This phase consists of eliminating rules that cover the same examples. Therefore, rules are first sorted according to support, then, the sequential covering method is applied, i.e., the patterns covered by the first rule are removed from the set of patterns and for the rest of patterns the second rule is chosen. At the end, we obtain a reduced set of rules that covers a maximum number of examples.

Fig 4. Filtering Phase (applied for each individual)

3.2.3.

Selection and creation of new rules:

After the filtering phase, the resulting rule sets (new individuals) found by steps 1 and 2, are evaluated on the basis of fitness function according to three criteria: fidelity, comprehensibility and coverage. The best individuals are then selected by tournament selection operator. Finally by applying crossover and mutation operators, a new generation of solutions is produced. We used two-point Crossover technique, as shown in Eq (7), where the two point’s p1 and p2 are selected randomly. The mutation operator takes one parent from the population and randomly changes two genes to create a child. (7)

SC RI PT

𝑝𝑎𝑟𝑒𝑛𝑡1(𝑖) 𝑖𝑓 𝑖 [𝑝1, 𝑝2] 𝑐ℎ𝑖𝑙𝑑(𝑖) = { 𝑝𝑎𝑟𝑒𝑛𝑡2(𝑖) 𝑖𝑓 𝑖 [𝑝1, 𝑝2]

We used gamultiobj MATLAB function available in the MATLAB optimization toolbox to solve Eq (5). This function uses the genetic algorithm to find the Pareto optimal solutions. It uses a controlled elitist genetic algorithm which favours individuals with better fitness value. The diversity of population is maintained in order to ensure the convergence to an optimal Pareto front by using two options, 'ParetoFraction' and 'DistanceFcn' [43]. The algorithm used in gamultiobj is described in [44]. 3.3.

Algorithm in detail

EP

TE

D

M

A

Algorithm 2 Rule_evaluation (Individual ,Examples, minsup, minconf) Decode individual in the rule set Best_Rules=Ø For every rule ri in the rule set do Calculate Sup(ri) the support value of (ri ); Calculate Conf(ri) the confidence value of (ri); Calculate Lift(ri) the lift value of (ri); If Sup(ri) > minsup and Conf(ri) > minconf and Lift(ri) >1 Then Best_Rules=Best_Rules U{ri } End if End For Sort Best_Rules decreasing on support, Return Best_Rules

N

U

For each population, the chromosome is decoded to a vector of rules, for each rule the support, the confidence and the lift values are analyzed. Rules whose support and confidence and lift values are greater than certain thresholds are selected and sorted by support in descending order (we prefer those attributes that occur frequently in the dataset to be used as often as possible). The resulting set only comprises the best rules (see algorithm 2).

A

CC

However, having rules that exceed these thresholds does not guarantee that all the rules in Best_Rules are useful. This phase consists on eliminating rules that cover the same examples. To do this, we start from an empty rule set. The rules in Best_Rules are added one by one to the rule set, where each time the cover value of the rule is calculated. If the rule satisfies at least one example of the rest of the examples that were not classified correctly by the rule set, then the rule is added to the rule set. We stop adding rules when one of the following conditions occurs: all the examples of the specified class are covered by the rule set or when all the rules of the Best_Rules are processed (see Algorithm 3). We then calculate the fidelity, the coverage, and the complexity of the obtained rule set. Algorithm 3 RuleSet_evaluation (Best_Rules, Examples) RuleSet=Ø; Coverset=0; i=1; Repeat Rule=Best_Rules(i); Calculate cover (Rule); /* the number of examples correctly classified by Rule If cover (Rule) <>0 Then RuleSet= RuleSet U Rule;

SC RI PT

Coverset=Coverset + cover; Examples = Examples – {Examples correctly classified by Rule}; End if i=i+1; Until (Termination criteria are met) Coverage=𝐶𝑜𝑣𝑒𝑟𝑠𝑒𝑡 ⁄𝑁𝑃 Calculate the fidelity of RuleSet, Calculate the complexity of RuleSet. Scores= [fidelity, Coverage, complexity]; Return (Scores) The multiobjective genetic algorithm function gamultiobj in MATLAB is then invoked to refine and optimize the candidate rule sets evolved by the previous algorithm and generate the Pareto optimal solutions by considering the fidelity and coverage maximization and the complexity minimization.

4.

Experimental Results and Discussion

N

U

This section evaluates the performance of our approach on five benchmark classification problems collected from UCI Machine Learning Repository [45]. The algorithm must be run for each one of the possible values of the target variable. It could be executed over any dataset with different sizes, classes and number of attributes. The summary of used datasets is given in Table 2. The proposed algorithm is developed under MATLAB environment. The NPRTOOL function is used to train neural network via the Neural Network Toolbox. The default training parameters are used.

1 0

D

𝑦𝑖 = {

M

A

Table 3 represents the classification rates on the training data, the classification rates on the test data where NNH is the number of neurons in the hidden layer. These results are obtained by using 10 fold cross validation. The idea is to divide the dataset into 10 parts. Each time nine parts are used to train and the last to test. After 10 runs, the average and standard deviation are computed. All presented datasets were processed by a multilayered feedforward neural network, trained with binary inputs, using the backpropagation method with one hidden layer. If original inputs are not binary, they have to be binarized using Eq (8) [16]. 𝑖𝑓 𝑎𝑖 ≥ 𝑢𝑖 𝑒𝑙𝑠𝑒

(8)

TE

Where: ai is the value of original input Ai, ui is the average value of Ai and yi is the corresponding binarized input value of ai.

EP

Table 2 Datasets Used in Our Computational Experiments

A

CC

Datasets

Patterns Number of each class (total) 50-50-50 (150) 453-237 (690) 307- 383 (690) 51 – 163 (214) 500-268 (768)

Iris

B.Cancer Austra Glass Pima

#A

#C

4

3

9

2

14

2

9

2

8

2

(#A =Number of attributes, #C: Number of classes) Table 3 Test and training classification rates

Classification rates Database

NNH

Train

Test

Iris B.Cancer Austra Pima Glass

4.1.

6 3 3 8 2

75.33±0.0093 99.15±0.0011 96.12±0.0034 76.50±0.0076 98.70±0.0037

75.33±0.0834 97.10±0.0118 86.51±0.0379 72.01±0.0474 94.47±0.0545

Performance measures

SC RI PT

For each class cj, we indicate with: - meanFid(cj): the average of the obtained values of the fidelity. - meanCov(cj): the average of the coverage values. - meanRules(cj): the average number of rules. - NP(cj): the number of patterns of class j. - CN(cj) and CD(cj) the average number of examples correctly classified by the neural network and by the rule set, respectively. 𝐶𝑁(𝑐𝑗 ) = meanFid(𝑐𝑗 ) ∗ 𝑁𝑃(𝑐𝑗 )

𝐶𝑁(𝑐𝑗 )

M

𝑁

𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑐𝑙𝑎𝑠𝑠𝑒𝑠

∑𝑗=1

𝐶𝐷(𝑐𝑗 )

𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑐𝑙𝑎𝑠𝑠𝑒𝑠

TE

#𝑅𝑈𝐿𝐸𝑆 = ∑𝑗=1

(11)

(12)

𝑁

D

𝐶𝑂𝑉 =

U

A

𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑐𝑙𝑎𝑠𝑠𝑒𝑠

∑𝑗=1

𝐹𝐼𝐷 =

(10)

N

𝐶𝐷(𝑐𝑗 ) = 𝑚𝑒𝑎𝑛𝐶𝑜𝑣(𝑐𝑗 ) ∗ 𝑁𝑃(𝑐𝑗 ) We denote with: FID: the total average value of the fidelity over all the dataset. COV: the total average value of the coverage over all classes. #RULES: the total average number of rules over all classes.

(9)

𝑚𝑒𝑎𝑛𝑅𝑢𝑙𝑒𝑠(𝑐𝑗 )

(13)

Where N is the number of all patterns of the given dataset. 𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑐𝑙𝑎𝑠𝑠𝑒𝑠

4.2.

𝑁𝑃(𝑐𝑗 )

(14)

EP

𝑁 = ∑𝑗=1

The settings of computational experiments

A

CC

Several tests were simulated to select the optimal parameters used in our experiments. The final configuration was adopted after analyzing which specific set-up globally yielded the best results. For the Rule filtering phase, we specified the minimum confidence and the minimum support as 0.50 and 0.1, respectively. The lift value must be greater than one. Our algorithm uses gamultiobj MATLAB function to obtain a Pareto front for three objective functions (fidelity, coverage and complexity). The settings of computational experiments are as follows: Population Size: 30, Number of variables: 15, Generations: 2000, stallGenlimit:250, Mutation probability: 0.2, Crossover probability: 0.8, ParetoFraction: 0.1 and Selection function: tournament. Default settings were used for other parameters. 4.3.

Results and discussion

Fig. 5 shows the convergence histories of the gamultiobj function for the malignant class, by varying the number of generations (Gen). Notice that the convergence time depends on the initial parameters such as the initial population and the number of generation. Despite the fact that it is important to find rules with high fidelity and high accuracy, it is more important to obtain reliable and interesting rules whose antecedents and consequents are well correlated. Therefore, the support, lift and confidence thresholds should be satisfied by each candidate rule. As we can see from Fig. 5, the first potential solution is obtained before the generation 5, having a fidelity value of 100% and a coverage value of 84%. At each

generation, the obtained solutions are improved by applying genetic operators. The algorithm requires 65 generations to find three (03) Pareto-optimal solutions which represent the optimum trade-off according to the three objectives. Table 4 shows the obtained Pareto-optimal solutions in terms of coverage, fidelity and complexity. Table 5 shows the corresponding rules (after decoding the chromosome) and their qualities in terms of support, confidence and lift. The rule {If Uniformity of Cell Shape ≥ 3,2 then malignant} is more interesting since its lift value is higher, meaning that the premise and the class are well correlated. Therefore, rules including (Uniformity of Cell Shape≥ 3,2) in the premise part, imply an increment in the occurrence of the malignant class. The same with the rule (If Clump Thickness ≥ 4,4 Then malignant), having a lift value of 2.6618.

CC

EP

TE

D

M

A

N

U

SC RI PT

Fig. 6 shows the average performance of the obtained rule sets for the Wisconsin Breast Cancer dataset using 10 fold cross validation by setting minfidelity and mincoverage to 0.98. As shown, the algorithm allows extracting 13 rule sets covering the benign and the malignant cases (five rule set for benign and eight for malignant). Each rule set represents one solution to the problem. For each class, the algorithm calculates the average fidelity (meanFid), the average coverage (meanCov) and the average of the number of rules (meanRules) as shown in the previous section. Then, CN and CD are calculated as shown in Eq (9) and Eq (10). Finally, the values of FID, COV and #RULES are calculated as shown in Eq (11), Eq (12) and Eq (13). Each rule in the rule sets has a support value greater than 0.1, a confidence value greater than 0.5 and a lift value greater than unity.

A

Fig 5. Convergence histories of the gamultiobj function Table 4 Obtained Pareto-optimal solutions in terms of coverage, fidelity and complexity for class malignant

Solution 1 Solution 2 Solution 3

Fidelity (%) 100% 100% 100%

Coverage (%) 87,85 99,06 100%

#Rules 1 2 3

Table 5 Rules qualities in terms of support, confidence and lift for class malignant

Paretooptimal Solution 1 Solution 2 Solution 3

Rules set If If If If If If

Support

(Uniformity of Cell Shape ≥ 3,2) Then malignant Clump Thickness ≥ 4,4) Then malignant (Bland Chromatin ≥ 3,4) Then malignant (Uniformity of Cell Shape ≥ 3,2) Then malignant (Clump Thickness ≥ 4,4) Then malignant (Bland Chromatin≥3,4) Then malignant

0.3023 0.2990 0.2797 0.3023 0.2990 0.2797

Confidence 0.9307 0.6764 0.9158 0.9307 0.6764 0.9158

Lift 2.7051 1.9659 2.6618 2.7051 1.9659 2.6618

TE

D

M

A

N

U

SC RI PT

As shown in Fig. 6, our proposal founds three rules, which properly classified all malignant class with highest fidelity of 100%. Fig. 7 shows the results obtained by our approach in terms of average fidelity, average coverage and average number of rules using Wisconsin Breast Cancer, Austra, Pima and Glass datasets, varying minfidelity and mincoverage values in the interval [0.6 1], rule limit = 10 (maximum number of rules). The rule sets are sorted according to the descending order of their FID values.

A

CC

EP

Fig. 6. Performance measures used to evaluate our approach by setting the parameters minfidelity and mincoverage to 0.98.

0.95 FID COV

0.85

0

2

4

6

8

10 Rule set

0

2

4

6

8

10 Rule set

12

16

18

20

12

14

16

18

20

A

3

14

U

3.5

N

# RULES

4

SC RI PT

0.9

M

(a) IRIS

1

TE

0.96 0.94

EP

0.92

FID COV

D

0.98

0

5

10

15

20 Rule set

25

30

35

40

5

10

15

20 Rule set

25

30

35

40

CC

5

# RULES

A

4.5

4 3.5 3

0

(b) Wisconsin Breast Cancer

1 FID COV

0.9 0.8

0

5

10

15 Rule set

0

5

10

15 Rule set

20

3.5

30

20

25

30

A

2.5

25

U

3

N

# RULES

4

SC RI PT

0.7

M

(c) Austra

FID COV

D

1

TE

0.9 0.8

EP

0.7

10

20

30 Rule set

40

50

60

10

20

30 Rule set

40

50

60

CC

0

8

# RULES

A

7 6 5 4

0

(d) PIMA

1 FID COV

0.98

0.94

0

5

10

15 Rule set

0

5

10

15 Rule set

20

4

20

25

30

25

30

A

3

U

3.5

N

# RULES

4.5

SC RI PT

0.96

M

(e) Glass Fig. 7. The results obtained by our approach using different datasets

TE

D

As shown in Fig.7 (a, d), obtaining high fidelity and high accuracy simultaneously is impossible for the Iris and Pima datasets. It seems that the thresholding used by Eq (8) for binarizing inputs patterns is responsible for the relatively poor performance of their corresponding neural networks (see table 2) and subsequently of the extracted rules. It is also worth mentioning that the networks with a large number of hidden nodes, (6 for Iris and 8 for Pima), always lead to poor generalization. Here we call the situation as the fidelity-accuracy dilemma. The final choice is left to the user to decide which measure is more important by modifying minfidelity and mincoverage parameters.

EP

As we can see in the Fig. 7 (b,c,e), best solutions have been found by our proposal having a well balanced trade-off between fidelity, coverage and complexity (number of rules). Focusing on the average coverage (COV) and the average number of rules (#RULES), it may be seen that the two measures evolve almost in the same way (see Fig. 7). It means that a solution (a rule set), having high accuracy requires a high number of rules. Although the complexity in this region is relatively high but it remains relatively good.

A

CC

Our proposal finds the rule (if A2 <3.12 and A6 <3.46 then benign) with fidelity 100%, satisfying 429 patterns of class benign, having a support value of 0.622, a confidence value of 0.986 and a lift value of 1.502. Also, the rule (if A1<1.52 and A2≥ 13.41 and A3<2.68 then non-window-glass) is obtained with a support value of 0.12, having a highest lift value of 4.04 and a confidence value of 0.96. This rule is more interesting despite the fact that it has a lower support value. This result demonstrates the usefulness of the lift measure, to overcome the imbalanced data problem by selecting interesting rules covering the minority class. Also, our approach founds the most general rule, if A4< 1.20 then setosa, which properly classified all 50 examples from the setosa class with a highest fidelity of 100%. This rule is obtained having a highest lift value of 0.33 a confidence value of 0.88 and a lift value of 2.63. Table 6 shows the results of experiments in terms of coverage, fidelity and complexity for five data sets. The average and standard deviation have been computed from 10 fold cross validation. The standard deviation (std) is used to measure the dispersion of the rule sets. A low std indicates that the measure values are well grouped around the average. Analysing the results in table 6, we notice that a small number of rules is obtained by our algorithm over all datasets. For breast cancer and Glass datasets, a well balanced trade-off between the fidelity and the coverage is achieved. A best trade-off is also achieved for the Austra dataset between the coverage and number of rules extracted. This shows that our proposal provides rules with high fidelity, accuracy and high comprehensibility (reduced complexity) on breast cancer, Austra, and Glass

datasets. For iris dataset, we notice that for minfidelity=0.90, and mincoverage=0.8, our proposal provides the same rule set (04 rules) after each run (std=0), having a fidelity of 0.9333 and a coverage of 88.89 which means that the program is stable in this region.

SC RI PT

Also, we can see that using a simple solution (a bitstring) to represent a rule set could be possible to produce accurate rule sets on a majority of datasets. The results obtained should be improved, especially for complex real data (Pima and Iris), by introducing more complex schemes to produce discrete attributes. The rules extracted are accurate, comprehensible and mimic knowledge contained in the neural network. Concluding the analysis, it is possible to state that due to the filtering phase which is used to look for reliable relations between the premises and the consequent by calculating the support, the confidence and the lift, our algorithm extracts the rules with a few conditions. However, the rules are interesting, exceeding the minimum support threshold (set to 0.100) and the minimum confidence threshold (set to 0.5) and having the lift value greater than one. Table 6 The experimental results in terms of fidelity, coverage and complexity using 10 fold cross validation

Fidelity

Coverage

Nrules

Iris Cancer Austra Pima Glass

93.33 ± 0.0000 99.94±0.0016 96.52± 0.0026 80.07±0.0259 99.51±0. 0078

88.89±0.0000 98.64±0.0045 97.80±0.0100 82.50±0.0430 98.65±0. 0061

4.00±0.0000 4.17±0.6771 3.67±0.2582 5.00±0.3873 3.90±0.3689

622.7011±23.1828 522.2516±20.9273 713.6921±121.4064 531.2846±9.5527 412.0493±14.7778

Comparison with other works

Minfidelity/ mincoverage/ rule limit 0.90/0.80/5 0.98/0.98/5 0.95/0.95/5 0.75/0.75/8 0.97/0.97/5

N

4.4.

Times (sec)

U

Dataset

A

In this section, we compare the performance of our algorithm with five rule extraction algorithms: GEX[18], REX_Pitt[46], RxREN [24], RX+CGA[25] and HNFB -1 [47].

M

RxREN starts pruning the insignificant input neurons of trained neural networks by analyzing the error occurred at the removal of that neuron. The second phase constructs the classification rules for each class from the significant neurons. The comprehensibility of rules can be improved by removing the conditions whose removal doesn’t affect the accuracy of a rule. For

TE

D

Breast cancer dataset, RxREN finds the reduced network with the architecture 6–3–2 with classification accuracy of 96.8% and extracts two rules (one rule for benign class + a default rule ) from the pruned network. RxREN did not extract any symbolic interpretation of the class malignant. The same for the Pima and Iris datasets, authors used the default rule to represent the last class.

EP

GEX used a global approach based on genetic algorithm to extract crisp rules from artificial neural networks. GEX used a special encoding for rules representation and special mutation operator which is dependent on the type of the input variable. To evaluate rules, GEX used a weighted average of several aspects such as coverage and comprehensibility. Authors declare that some parameters such the number of individuals and the number of generation need to tune them for a given data set which can be perceived as a drawback of the proposed method.

A

CC

REX_Pitt used a global approach based on genetic algorithm to extract Fuzzy rules from artificial neural networks. REX_pitt used Pitt approach to encode rules. Therefore, one chromosome encodes a set of fuzzy rules. Like GEX, REX_pitt used a weighted average of several measures such as the ratio between correctly and incorrectly classified patterns, number of premises and the number of rules. The algorithm does not ensure the optimization of the number of rules. Gonçalves et al. [47] proposed a neuro-fuzzy model called HNFB -1 (inverted hierarchical neuro-fuzzy BSP System) dedicated to pattern classification and rule extraction. It allows the extraction of the fuzzy classifications rules from databases. According to the authors, the HNFB-1 converged in less than one minute for all the databases. The disadvantage of this approach is that the HNFB-1 does not take into consideration any dependency or correlation that might exist between variables. Firstly, notice that our proposal extracts a fewer number of rules than most algorithms (see tables 7-9). Table 7 shows that except for Pima, our proposed algorithm achieved a better trade-off between fidelity, coverage and number of rules than GEX. Focusing on the coverage and the number of rules measures (here rule limit=3), a comparison between RxREN, RX+CGA, HNFB -1and our proposal, is performed (see Table 8). It should be note that the RX+CGA results presented in

table 8 are picked from studies of [24]. Except for Iris data set, our proposal provides well compromise between the accuracy and the comprehensibility. It’s important to notice that our approach is more effective than RxREN and produces rules for all classes without necessity of existence the default rule. A best trade-off between the fidelity and the number of extracted rules is achieved by our proposal using Iris dataset compared with REX_PITT (see table 9). The results reveal how well the rules mimic the behaviour of the neural network. Finally, it’s important to notice that our approach is more effective than other methods and produces rules with various properties by taking into account not only fidelity, coverage and complexity but also, support, confidence and lift; this makes the rules more reliable, interesting and understandable for humans. Statistical comparisons of algorithms

SC RI PT

4.4.1.

As we can see in tables 7, 8, the comparison results between the algorithms have been done according to many measures (fidelity, comprehensibility, and accuracy). In this section, statistical comparisons of algorithms over various datasets have been performed by combining the measures into the single one that we called performance. We assume that each measure has the same importance weight (1/M) where M is the number of measures. It should be noted that the values of comprehensibility have been normalized before processing. 𝑀

1 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = ∑ 𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑖 M 1

N

U

We have used TTEST2 MATLAB function available in the MATLAB statistics toolbox to test the significance of the proposed method over to others algorithms according to the performance values. The TTEST2 function compares two algorithms at a time over multiple datasets. This function calculates the two-sample T-test at the chosen level of significance (here alpha=0.05) and performs the test against the alternative specified by the string tail. The null hypothesis is that the means are equal.

A

• The proposed algorithm versus Gex: As we can see in table 7, the results are obtained by using 10 fold cross validation (10 runs) which gives a population size of N=30 on the 03 datasets (10 for Breast cancer, 10 for Austra, and 10 for Pima).

M

H0= Null hypothesis: means are equal. H1= Alternative hypothesis: Mean of the proposal is greater than mean of GEX (right-tail test).

EP

TE

D

The results of the statistical test are given in table 10. As we can see, the value of the test statistic (tstat) calculated by the ttest2 function is equal to 5.95 with a degree of freedom (Df) of 58 (2N-2). The calculated tstat is then compared to the critical value (t_crit) which is obtained from the table of exact critical values (here t_crit= 1.67). As the calculated tstat is greater than the critical value at an alpha level of 0.05 we reject the null hypothesis. i.e., the proposed algorithm is significantly better than GEX. The statistical decision can be done by comparing the p-value to the alpha value. p-value (8.33e-008) is very small compared to alpha (0.05), so we reject the null hypothesis. i.e., the proposed algorithm is significantly better than GEX. • The proposed algorithm versus RX+CGA

CC

H0= Null hypothesis: means are equal. H1= Alternative hypothesis: Mean of the proposal is greater than mean of RX+CGA.

A

As we can see in table 10, the p value (0.0027) is less than the alpha value (0.05), we reject the null hypothesis i.e., the proposed algorithm is significantly better than RX+CGA. • The proposed algorithm versus HNFB -1 H0= Null hypothesis: means are equal. H1= Alternative hypothesis: Mean of the proposal is greater than mean of HNFB-1. The p value (1.46e-007) is less than the alpha value (0.05), we reject the null hypothesis i.e., the proposed algorithm is significantly better than HNFB-1. • The proposed algorithm versus RXREN

As mentioned above, RxREN used the default rule to represent the last class which does not represent a symbolic interpretation of the class, so it's difficult to make the comparison.

Table 7 Comparison with GEX performance in terms of number of rules, fidelity and coverage for Breast cancer, Austra and Pima datasets

Breast Cancer

Austra

Pima

Performances measures

GEX

Our approach

Fidelity Coverage Nrules Fidelity Coverage Nrules Fidelity Coverage Nrules

98.20 ± 0.014 98.10 ± 0.02 19.64 ± 2.33 89.90± 0.045 64.30± 0.056 67.52± 4.98 97.00± 0.022 88.90± 0.062 27.80± 3.56

99.94±0.0016 98.64±0.0045 4.17±0.6771 96.52± 0.0026 97.80±0.0100 3.67±0.2582 80.07±0.0259 82.50±0.0430 5.00±0.3873

SC RI PT

Datasets

96.4 77.2 97.3

97.08 70.13 98

2.0 2.0 3.0

3.0 5.0 3.0

HNFB -1 Acc # rules

Our approach Acc # rules

78.26 98.67

98.55 81.19 95.56

N

RX+CGA Acc # rules

55 19

3 3 3

M

Breast Cancer Pima Iris

RxREN Acc # rules

A

Datasets

U

Table 8 Comparison with RxREN, RX+CGA and HNFB -1 performance in terms of number of rules and coverage for Breast cancer, Pima and Iris datasets

Table 9 Comparison with REX_PITT performance in terms of fidelity and number of rules for Iris dataset

D

REX_PITT

Dataset

TE

Fidelity

98.67

EP

Iris

Our approach

# rules

Fidelity

# rules

3.8

100

3

A

CC

Table 10 Statistical Comparisons of the proposal with GEX , CGA and HNFB-1

5.

P Tstat Df N

GEX 8.33e-008 5.95 58 30

CGA 0.0027 2.89 58 30

HNFB-1 1.46e-007 6.21 38 20

Conclusion

In this paper, we proposed a new, simple and efficient algorithm for extracting highly accurate and understandable rules from trained neural network based on multiobjective genetic algorithms. The algorithm can be easily executed over any dataset, with different sizes, classes and number of attributes. It can also be used for extracting rules directly from data. The rule filtering phase avoids obtaining misleading rules by taking into account the support, the confidence and the lift measures. As a result, fewer ANN rules with fewer conditions (premises) are generated. The lift allows extracting the interesting rare rules (with a lower support value) in the case of unbalanced databases such as PIMA and Glass datasets. The

results show that the proposal provides the best trade-off between the fidelity, the coverage and complexity when comparing it with others works. The results obtained should be improved, especially for complex real data, by introducing more complex schemes to produce discrete attributes. The question that still remains unanswered is how far neural networks can help us, failing to formalize knowledge, at least to acquire it from the data, especially in the area of big data by using rule extraction to improve the comprehensibility of predictive Models.

Conflict of Interest

A

CC

EP

TE

D

M

A

N

U

SC RI PT

The authors declare that they have no conflict of interest.

References

[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29]

A

[30]

SC RI PT

[10]

U

[9]

N

[8]

A

[7]

M

[5] [6]

D

[4]

TE

[3]

EP

[2]

M. D. Pandya , R. Patel Jay, A Survey: Artificial Neural Network for Character Recognition, Proceedings of Fifth International Conference on Soft Computing for Problem Solving, Volume 436 of the series Advances in Intelligent Systems and Computing , Springer , Singapore ( pp 403-410, 2016) A. Boutorh , A. Guessoum, Complex diseases SNP selection and classification by hybrid Association Rule Mining and Artificial Neural Network based Evolutionary Algorithms, Engineering Applications of Artificial Intelligence 51(2016)58–70 Y. Hayashi, R. Setiono, A. Azcarraga, Neural network training and rule extraction with augmented discretized input, Neurocomputing 207 (2016) 610–622. M. A. Augasta, M. A. T. Kathirvalavakumar, Rule extraction from neural networks - a comparative study, in: Proceedings of the International Conference on Pattern Recognition, Informatics and Medical Engineering, pp. 404–408 (2012). I. Khan, A. Kulkarni, Knowledge extraction from survey data using neural networks, Proc. Comput. Sci. 23(2013) 433–438. H. Fernando, B. Surgenor, An unsupervised artificial neural network versus a rule-based approach for fault detection and identification in an automated assembly machine, Robotics and Computer-Integrated Manufacturing 43 (2017) 79–88. G. Zeng, H. Huang, X. Pei, S.C. Wong, M. Gao, Rule extraction from an optimized neural network for traffic crash frequency modeling, Accident Analysis and Prevention 97 (2016) 87–95. H. Etemadi, A. Ahmadpour, S. M. Moshashaei, Earnings Per Share Forecast Using Extracted Rules from Trained Neural Network by Genetic Algorithm, Comput Econ (2015) 46:55–63. D. Yedjour, H. Yedjour, A. Benyettou, Combining Quine Mc-Cluskey and Genetic Algorithmts for Extracting Rules from Trained Neural Networks, Asian Journal of Applied Sciences 4(1):72-80, 2011. M. Craven, J. Shavlik, Extracting tree-structured representations of trained networks. Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, 8: 24-30, 1996. B. Dhanalaxmi, G. A. Naidu, K. Anuradha, Adaptive PSO based Association Rule Mining Technique for Software Defect Classification using ANN, Procedia Computer Science 46 (2015) 432 – 442. S. Shinde, U. Kulkarni, Extracting classification rules from modified fuzzy min–max neural network for data with mixed attributes, Applied Soft Computing 40 (2016) 364–378. M. Craven, J. Shavlik, Rule Extraction: Where Do We Go from Here?, University of wisconsin Machine Learning Research Group working, Paper 99-1, 1999. ZH. Zhou, Rule extraction, Using neural networks or for neural networks?. Journal of Computer Science and Technology 19(2):249-253, 2004. G. Towell, J. W. Shavlik, The Extraction of Refined Rules from Knowledge-Based Neural Networks. Machine Learning, 131 (1993), 71-101. I. Taha, J. Ghosh, Symbolic interpretation of artificial neural networks, IEEE Trans. Knowl. Data Eng. 11(3): 448–463, 1999. S. Thrun, Extracting Rules from Artificial Neural Networks with Distributed Representations. In Advances in Neural Information Processing Systems, MIT Press, San Mateo, CA, 1995. U. Markowska-Kaczmar, Evolutionary Approaches to Rule Extraction from Neural Networks, Studies in Computational Intelligence (SCI) 82:177– 209, 2008. R. Andrews, J. Diederich, A.B. Aickle, Survey and critique of techniques for extracting rules from trained artificial neural networks, KnowledgeBased Syst 8(6):373–389, 1995. E. Junqué de Fortuny, D. Martens, Active Learning-based Pedagogical Rule Extraction, IEEE Transactions on Neural Networks and Learning Systems, 2015. K. Saito, R. Nakano, Medical Diagnostic Expert System Based on PDP Model. In Proceedings of the IEEE (San Diego, CA. 1988), IEEE press, 255262. S.I. Gallant, Neural Networks Learning and Expert Systems., MIT Press, (1993). SN. Tran and A. S. d’Avila Garcez, Deep Logic Networks: Inserting and Extracting Knowledge From Deep Belief Networks, IEEE Transactions On Neural Networks And Learning Systems 2016. M. Augasta, T. Kathirvalavakumar, Reverse Engineering the Neural Networks for Rule Extraction in Classification Problems, Neural Processing Letters 35(2):131-150, 2012. ER. Hruschka, NFF. Ebecken, Extracting rules from multilayer perceptrons in classification problems: a clustering-based approach. Neurocomputing 70 ( 2006) 384–397. JR. Zilke, EL. Mencía, F. Janssen, DeepRED –Rule Extraction from Deep Neural Networks, International Conference on Discovery Science, DS 2016: pp 457-473. A. Bondarenko, L. Aleksejeva, V. Jumutc, A. Borisov, Classification Tree Extraction from Trained Artificial Neural Networks, Procedia Computer Science 104 ( 2017 ) 556 – 563. X. Fu, L. Wang, Rule extraction by genetic algorithms based on a simplified RBF neural network, In: Proceedings congress on evolutionary computation, pp 753–758 (2001). U. Markowska-Kaczmar, K. Mularczyk, GA-Based Rule Extraction from Neural Networks for Approximation, In: Proceedings of the International Multiconference on Computer Science and Information Technology, pp. 141–148 (2006) F. Ahmadizar, K. Soltanian, F. AkhlaghianTab, I. Tsoulos, Artificial neural network development by means of a novel combination of grammatical evolution and genetic algorithm. Eng. Appl. Artif. Intell. 39,1–13, 2015. J.H Moore, D.P. Hill, Epistasis analysis using artificial intelligence, Epistasis. Springer,NewYork, 2015, pp.327–346. A. Konak, DW. Coit, AE. Smith, Multi-objective optimization using genetic algorithms: A tutorial, Reliability Engineering and System Safety, 91(9), 992-1007, 2006, doi:101016/jress200511018 Zitzler E, Thiele L (1999). Multiobjective Evolutionary Algorithms: A Comparative Case Study and Strength Pareto Approach. IEEE Transaction on Evolutionary Computation 3(4): 257–271. M. Elarbi , S. Bechikh, L. Ben Said, R. Datta, Multi-objective Optimization: Classical and Evolutionary Approaches, Recent Advances in Evolutionary Multi-objective Optimization Volume 20 of the series Adaptation, Learning, and Optimization, springer, 2017, pp 1-30. R. Agrawal, T. Imielinski, A. Swami, Mining Association Rules between Sets of Items in Large Databases. In: Proceedings of ACM-SIMOD international conference on management of data. Washington, DC, pp 207–216, 1993. J.M. Luna, J.R. Romero, S. Ventura, On the adaptability of G3PARM to the extraction of rare association rules, knowledge and information systems, 38(2): 391-418, 2014.

CC

[1]

[31] [32] [33] [34] [35] [36]

A

CC

EP

TE

D

M

A

N

U

SC RI PT

[37] W. Li, J. Han, J. Pei, CMAr: accurate and efficient classification based on multiple class association rule. In: Proceedings of the ICDM’01, San Jose, CA, pp.369–376, 2001. [38] B. Liu, W. Hsu, Y. Ma, Integrating classification and association rule mining. In: Proceedings of the KDD’98, New York, NY, pp. 80–86, 1998. [39] M.M. Jahangir Kabir , S. Xu, B. Ho Kang, Z. Zhao, A New Evolutionary Algorithm for Extracting a Reduced Set of Interesting Association Rules, Neural Information Processing, Volume 9490 of the series Lecture Notes in Computer Science pp 133-142, 2015. [40] F. Berzal, I. Blanco, D. Sánchez, MA. Vila, Measuring the accuracy and interest of association rules: a new framework. Journal Intelligent Data Analysis. IOS Press 6(3):221–235, 2002. [41] C. Tew, C. Giraud-Carrier, K. Tanner, S. Burton, Behavior-based clustering and analysis of interestingness measures for association rule mining, Data Min Knowl Disc (2014) 28:1004–1045. [42] M. Karabatak, M.C. Ince, An expert system for detection of breast cancer based on association rules and neural network. Expert Syst. Appl. 36(2), 3465–3469, 2009. [43] S. Jena, P. Patro, S S. Behera, Multiobjective Optimization of Design Parameters of a Shell &Tube type Heat Exchanger using Genetic Algorithm, International Journal of Current Engineering and Technology 3(4):1379-1386, 2013. [44] K. Deb, Multiobjective Optimization Using Evolutionary Algorithms. John Wiley & Sons, 2001, ISBN 047187339. [45] C.C. Blake C. Merz, UCI Repository of Machine Learning Databases, (1998) University of California, Irvine, Dept. of Information and Computer Sciences. [46] U. Markowska-Kaczmar, W. Trelak, Fuzzy logic and evolutionary algorithm—two techniques in rule extraction from neural networks, Neurocomputing, Volume 63, pp. 359-379, 2005. [47] LB. Gonçalves, MM. Bernardes, and R. Vellasco, “Inverted hierarchical neuro-fuzzy bsp System: A novel neuro-fuzzy model for pattern classification and rule extraction in databases,” IEEE transactions on systems, man, and cybernetics, part c: applications and reviews, 36(2): 236-248, March 2006.

Authors’ information

SC RI PT

D. Yedjour born in Algeria, she received Engineering Degree from the University of Oran, the MSc Degree and the PhD in computer sciences from the University of Sciences and Technology of Oran (USTO). His research interests focus on neural networks, data mining and machine learning.

A

CC

EP

TE

D

M

A

N

U

A. Benyettou born in 1955 in Algeria, he received his Engineering Diploma in 1982 from the Institute of Telecommunications of Oran and his MSc in 1986 from the University of Sciences and Technology of Oran (USTO), Algeria. In 1987, he joined the Computer Sciences Research Center of Nancy, France, until 1991 and worked, on Arabic speech recognition by expert systems (ARABEX) and received the PhD in electrical engineering in 1993, from the USTO. He is actually professor at USTO since 2003. He is currently a researcher director of the Signal-Speech-Image SIMPA Laboratory, since 2002. His current research interests are in the area of pattern recognition, artificial intelligence and neurocomputing.