Associative classification based on the Transferable Belief Model

Associative classification based on the Transferable Belief Model

Knowledge-Based Systems 182 (2019) 104800 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locat...

700KB Sizes 0 Downloads 64 Views

Knowledge-Based Systems 182 (2019) 104800

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Associative classification based on the Transferable Belief Model✩ Francisco Guil Department of Informatics, High Engineering School, University of Almería, Spain

article

info

Article history: Received 22 June 2018 Received in revised form 5 June 2019 Accepted 7 June 2019 Available online 10 June 2019 Keywords: Data mining Associative classification Transferable Belief Model Pignistic transformation

a b s t r a c t Associative classification is, in essence, a frequent itemset mining technique designed specifically for classification tasks. It can be viewed as a second-order data mining problem, where the aim is to obtain an accurate classifier taking the mined set formed by class association rules (CARs) as raw material; that is, association rules in which only the class attribute is considered in the rule’s consequent. Traditionally, the classifier is obtained from the initial CAR − set applying heuristic methods for sorting and pruning, leading to a (relatively small) set of classification rules. In this paper, instead of using heuristic-based methods, we propose the use of the Transferable Belief Model for obtaining an understandable, accurate and compact classifier composed of (pignistic) probability functions that summarize the huge mined set of rules. The experimental results obtained with benchmark datasets show the effectiveness of our promising proposal. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Classification and Frequent Itemset Mining (FIM) are two important tasks belonging to the data mining area. Although they were conceived as solutions for resolving different sorts of problems, their integration was proposed early on due to the inherent nature of the association rules and their suitability for building accurate classifiers. A proposal for integrating both tasks appeared firstly in Liu et al. [1], where the authors presented an algorithm, called CBA (Classification Based on Association), consisting of two steps: (1) generating all the class association rules (CAR − set) as a specific subset of the whole set of association rules; and (2) building a classifier based on the generated set by pruning and sorting the rules it contains. The generation of the set of association rules from a dataset follows a FIM approach [2]. The general expression of an association rule is a pattern X → Y , where X and Y are frequent itemsets and X ∩ Y = ∅. However, if the dataset contains a multi-label class attribute (multi-class) then it is possible, and quite interesting, to extract only a significant subset of association rules of the form X → c, where X is a frequent itemset and c is a class value. Both support and confidence are two frequentist measures for assessing the ‘‘interestingness’’ of a rule. Depending of its values, which must be greater than or equal to two userdefined parameters, the size of the mined CAR − set will vary ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.06.008. E-mail address: [email protected]. https://doi.org/10.1016/j.knosys.2019.06.008 0950-7051/© 2019 Elsevier B.V. All rights reserved.

significantly, but generally results in a huge number of rules. So, syntactic simplicity and great expressive power contrast with size. This is one of the main issues related to the FMI area. To deal with this problem, in [1] the authors proposed a heuristic method for sorting and pruning the original CAR − set, resulting in a (relatively) small set of classification rules forming an accurate associative classifier. This paper [1] was the source of myriad references belonging to the emerging associative classification area. These references focus on one of the two main aims — accuracy or execution time. The most outstanding references found in the literature are the algorithms MCAR [3], MAC [4], MLRP [5], CAR − Miner [6], FACA [7], ACPRIMS [8], and PCAR [9], amongst others. Adopting a different approach, in this paper we propose the use of a formal method for obtaining an accurate classifier from the mined CAR − set. Considering the CAR − set as a set of belief bodies, each one associated with a class label, our method obtains a probability function set that compactly summarizes the regularities presented in the original dataset. Each probability function is defined by the frequent items (literals) of the input dataset and behaves as a covering function, taking into account both frequentist measure and structure. Intuitively, while the confidence or the length of a rule is just a descriptive value, the probability function is a valid and informative value for decision making, as it is calculated bearing in mind the frequentist measures and the structure of each rule, and all the rules that have items in common. The underlying idea is that, if we want to obtain a classifier based on the information provided by the CAR − set, it will be more consistent if the related uncertainty can be described by probability functions. To obtain the covering probability functions from the CAR − set, we propose using the pignistic transformation defined in the

2

F. Guil / Knowledge-Based Systems 182 (2019) 104800

Transfer Belief Model (TBM). TBM is a model for the representation of quantified beliefs held by an agent. It is a two-level model: the credal level where beliefs are held and represented by belief functions, and the pignistic level, where decisions are made by maximizing expected utilities. In order to compute these expectations, it is necessary to build a probability measure in the pignistic level. This probability measure is called pignistic probability and denoted as BetP [10]. In our approach, the agent is the data mining algorithm and the beliefs are represented by, precisely, the set of class association rules, characterized by their structure and frequentist measure. The set of derived BetP functions will form the core of a pignistic classifier capable of obtaining accurate class prediction from any query object (or test instance). Our approach belongs to the area of classification area that uses belief functions, which have been the object for research over years. In 2006, Denoeux and Smets [11] reviewed, classified and presented a comparative analysis of two main families of approaches – model [12] and case-based [13,14] classifiers – both applying the Transferable Belief Model to classification problems. In [15] the authors introduced a hybrid (hard, fuzzy or credal) classification method to capture the imprecise information using the set of classes. A weighted evidential-based classifier combination method has been recently developed [16] to well manage the uncertainty and improve the classification accuracy. However, at present, making decisions with belief functions is still an open issue, with two main unresolved issues; namely, the definition of mass functions (basic probability assignment) [17, 18] and the decision-making resulting from them [19]. As the author asserts in [19], the pignistic criterion is belief-function approach to decision-making, where decisions are made by maximizing expected utilities with respect to a probability measure derived from a belief function. As the author remarked in [13], there is no contradiction between the choice of belief functions to represent uncertainty and the introduction of a probability distribution to allow decision-making. In this context, we want to highlight the paper by [17], in which the authors presented a statistical-based method to obtain basic probability assignments for classification problems. After deriving a probability model for each attribute (Gaussian process), they intersect these probability models with query objects, obtaining a mass function for each attribute, needing a later combination phase to obtain the final basic probability function. This function is finally transformed into a pignistic probability distribution for decision-making. Although interesting for its experimental nature, this complex global process requires intensive computationally resources and the theoretical model fits by assuming several requirements, and by assigning values to a large set of user-defined parameters. A different proposal to build a basic probability assignment is proposed in [20], constructing the classifier not from data, but instead from a set of mined class-association rules. By using an efficient algorithm of 2 userdefined parameters for pattern mining, the authors proposed a classification method under the correct assumption that every pattern is important, and none can be overlooked. So, the result is a classifier formed by the whole set of class-association rules and a heuristic-heavy method to infer new knowledge from it. In this case, the evidence theory is used to combine weighted-evidence bodies derived from the set of matching rules. Thus, making a single decision involves a global process for obtaining a set of matching rules, deriving a combined basic probability assignment and, finally, selecting the maximal belief as the classification result. In accordance with the idea of using mined patterns as raw material, our approach reinterprets how belief functions are used in associative classification by summarizing the whole set of car association rules into pignistic probability functions following the

principles of TBM. The main result from our approach is that one can build of a compact, light-weight and accurate classifier managed by a simple inferring method, bearing in mind the computational requirements of modern knowledge-based systems. In order to demonstrate the method’s usefulness, we have carried out a series of experiments to show how our method behaves with 12 benchmark, 2-class datasets, taken from the UCI Machine Learning Database Repository. To obtain the input class association rules, we have used the Apriori [21] mining algorithm implemented in WEKA workbench [22]. This mining algorithm reduces the minimum support of the itemsets in an iterative way until it finds the required number of class association rules with the given minimum confidence. The rest of this paper is organized as follows. Section 2 introduces the Associative Classification problem. Section 3 introduces the basis of the Transferable Belief Model. Section 4 presents the proposed method for summarizing class association rules into a pignistic classifier. Section 5 presents the basic algorithms and their application to a case study. Section 6 describes an empirical evaluation with dense benchmark datasets. Conclusions and future work are finally laid out in Section 7. 2. Associative classification Associative Classification is an emerging area that combines association rule mining and (supervised) classification techniques. While the aim of classification problems is the prediction of class labels and the aim of association mining is the discovery of frequent associations between attribute values in a dataset, their combination leads to accurate classifiers composed of interesting rules that cannot be discovered by traditional classification algorithms [3]. The basic idea consists of extracting only a certain subset of the whole set of mined association rules, composed of a special type of rules known as class association rules. Let us define the associative classification problem in the FMI framework. Let D be a dataset with n attributes, A = {A0 , A1 , . . . , An−1 }. Let C = {c0 , c1 , . . . , cc −1 } the domain of the class attribute, where each ci is a class label. Each attribute Ai must be categorical, being necessary for applying discretization techniques if any of the attributes are continuous. This is motivated by the constraint of Apriori-like algorithms, which cannot handle numeric attributes. So, each attribute Ai has a discrete domain associated. Definition 1 (Item). Let us denote an item i as the pair formed by an attribute and a specific value associated with it, that is, i = (Ai , vji ), where vji ∈ dom(Ai ). Definition 2 (Itemset). Let I be the set of all items in the dataset. A set X ⊆ I is called an itemset of length k, or k − itemset. Definition 3 (Frequent Item). An item i is considered frequent iff its support value is greater than or equal to a user-defined parameter named minimum support and denoted by σ s . The support of an item is defined as follows: support(i) =

fr (i)

|D|

,

(1)

where fr denotes the number of occurrences of the item in the dataset, and |D| is the total number of objects (records or examples) in the dataset. Definition 4 (Set of Frequent Items). Let us denote If as the set of frequent items in D, that is, If = {i0 , i1 , . . . , if −1 }, where support(ij ) ≥ σ s , 0 ≤ j < f .

F. Guil / Knowledge-Based Systems 182 (2019) 104800

3

Definition 5 (Frequent Itemset). An itemset is considered frequent iff its support value is greater than or equal to a user-defined parameter named minimum support and denoted by σ s . Definition 6 (Frequent Itemset Mining). Given a dataset D, and the value of the user-defined parameter σ s , the main goal of any Apriori-like mining algorithm is to find the set of all frequent itemsets in D. Frequent Itemset Mining is an iterative process with a levelwise generation, where itemsets of length k (k-itemsets) are generated in each level. The algorithm starts by generating a set of candidate itemsets of length 1, eliminating non-frequent, and obtaining frequent 1-itemsets, denoted as If . This set is used to generate frequent 2-itemsets, which in turn is used to generate 3-itemsets, and so on, until no candidate set of length k + 1 remains. In general, k-itemsets are generated right after each of the k−1 itemsets have been generated. In each iteration, the algorithm uses the Apriori property (also known as downward closure), which states that if an itemset is not frequent then all of its superitemsets must also be infrequent. This property is used to prune the search space, which becomes considerably reduced, making frequent itemset mining a computationally feasible problem. Definition 7 (Association Rule). From the set of frequent itemsets, a set of association rules are derived of the form R : X → Y , where X , Y ⊆ If , and X ∩ Y = ∅ [21]. An association rule R is considered interesting iff its confidence value is greater than or equal to another user-defined parameter named minimum confidence and denoted as σ c , where: confidence(R : X → Y ) =

support(X ∪ Y ) support(X )

(2)

In our context, and bearing the classification problem in mind, the mining process is only focused on a special subset of association rules, named class association rules, the right hand of which are restricted to the class attribute [1]. Definition 8 (Class Association Rule). A class association rule is a rule of the form R : X → c, where X ⊆ If is a frequent itemset and c ∈ C . In this case, the support of the rule is calculated by the formula: support(X → c) =

fr (X ∪ c)

|D|

,

(3)

and its confidence value is calculated via the formula: confidence(X → c) =

support(X → c) support(X )

(4)

A CAR is considered frequent iff its confidence value is greater than or equal to the user-defined parameter σ c . Definition 9 (CAR − set Mining). Thus, given a dataset D and the values of the user-defined parameters σ s and σ c , the main goal of CAR-mining is to find a set, denoted as CAR − set, such that it contains all the frequent CARs in D; that is, CAR − set = {Ri : X → c |confidence(Ri ) ≥ σ c ∧ support(X ) ≥ σ s }, where Ri is a CAR. Definition 10 (Associative Classifier). In general terms, an associative classifier is formed by the pair (pCAR − set , HM), where pCAR − set is a significant subset (pruned and sorted) of CAR − set, and HM denotes a heuristic and rule-based method for making accurate predictions. In [1], the authors presented a heuristic algorithm for classifying new instances (query objects) whose core is a sorting method

Fig. 1. Associative classification mining.

that establishes a precedence degree between two rules based on three criteria: confidence, support and generation time. In [3], the authors extend the precedence criteria, taking into account the length of the rules involved and the class distribution. Finally, in [9], the first criteria for sorting rules is their predictive rate, a frequentist value obtained in the cross-validation step. Moreover, the combinatorial nature of this type of rules implies that the discovery process leads to a huge number of association rules. As reported in [8], all existing CAR-mining algorithms inherited the problems inherent to the FMI area; namely, massive numbers of rules found and a very large derived classifier. We would like to highlight the work presented in [5], where the authors proposed the pruning of redundant rules and took into account the dependence rules for ranking CARs. But, as they reported, resolving the rule dependence problem is timeconsuming, proposing instead a polynomial-time algorithm for pruning that outperforms the L3 method presented in [23]. In [24] the authors present a review and comparison of the common approaches in the literature focused on this area. Of the different steps in the AC mining process (summarized in Fig. 1), we focus mainly on the classifier building and evaluation steps. Given a query object and the CAR − set, a heuristic method is designed to assign the most appropriate class value (if any), taking the rule (or rules) decision that best matches the query object. Moreover, the decision is taken based on a ranking where confidence, support and length are taken into account. However, the question here is, why is confidence or support more important than length?. In our opinion, these different descriptive values have the same importance and they must be taken into account to obtain quantifiable decision values. For example, let us consider the query object Rq : a1 , b3 → c ? . And suppose that our classifier is composed of two rules: R1 : a0 , a1 , b0 , b1 , b3 → c1 , and R2 : a0 , a1 , a2 , b1 , b3 → c2 , where the confident values of both rules are slightly different. Which is the best rule for assigning the class value to Rq ? It is likely that different heuristic methods will take different decisions. Nevertheless, in our approach, the best rule does not exist; instead, each rule is a piece of valuable information that quantifies beliefs on a given frame of discernment. It is clear that uncertainty is inherent in this problem and, therefore, a possible solution could be based on managing this feature.

4

F. Guil / Knowledge-Based Systems 182 (2019) 104800

Therefore, we consider that a CAR − set transformation is needed to make decisions, and this transformation will be guided by confident values and the inherent structure of the rules. In this way, we propose that all the rules with the same consequent are summarized into a probability function defined by the set of frequent items, where each probability value is computed taking into account all the rules where the frequent item is presented. So, rather than building a classifier formed by the pair class rules and heuristic-based method, our classifiers will simply consist of a set of probability functions, managed by a formal yet simple method to make decisions based on them. In addition, this proposal will be carried out on the basis of the Transferable Belief Model framework, which is presented in the next section. 3. The transferable belief model The TBM is a model for representing quantified beliefs based on belief functions [10]. It consists of an interpretation of the Dempster–Shafer model [25], a mathematical Theory of Evidence for modeling complex systems, where uncertainty is represented by quantified beliefs. The foundations of TBM rely on belief functions, a more general system than probabilistic quantification. In this approach, the authors proposed the theory based on the fact that there is a two-level mental model:

• Credal level, where beliefs are entertained, and represented by belief functions.

• Pignistic level, where beliefs are used to make decisions by maximizing expected utilities. The Credal level (credal from ‘‘credo’’, a Latin word that means ‘‘I believe’’). Belief measures are fuzzy measures assigned to propositions to express the uncertainty associated with them. Let us consider a k-elements frame of discernment Ω = {ω0 , . . . , ωk−1 }. A basic belief assignment (bba) is a mapping mΩ : 2Ω → [0, 1] that ∑ satisfies m(A) = 1. The bba values are called basic belief A⊆Ω masses (bbm). If m(∅) = 0 we say that our bba is normalized. The TBM postulates that the impact of a piece of evidence on an agent is translated by an allocation of parts of an initial unitary amount of belief amongst the propositions of Ω . For A ∈ Ω , m(A) is a part of the agent’s belief that supports A. Every A ⊆ Ω such that m(A) > 0 is called a focal element. The main difference with probability models is that masses can be given to any proposition p of Ω , with p ∈ 2Ω , instead of only to its atoms. The Pignistic Level (pignistic from ‘‘pignus’’, a Latin word that means ‘‘a bet’’). With the evidence available to an agent, the TBM model allows one to obtain a belief function that describes the credal state in the frame of discernment. If we want to make a decision based on this credal state, it is first necessary to find a rule that allows for the construction of a probability distribution from a belief function. The authors firstly proposed a method based on the Insufficient Reason Principle: if we have to build a probability distribution on n elements, given a lack of information, we give a probability 1/n to each element. The probability distribution is obtained by distributing m(A) amongst the atoms of A, so that, for each ω ∈ Ω , BetP(ω) =

∑ A⊆Ω ,ω∈A

m(A)

|A|

,

(5)

where |A| is the number of Ω atoms in A. BetP is, mathematically, a probability function, but the authors called it the Pignistic Probability Function to stress the fact that it is the probability function in a decision context. The name BetP

was selected to highlight its real nature; namely, a probability measure for decision-making (for betting). Once BetP is calculated, for the set of atoms of Ω , for each A ⊆ Ω, BetP(A) =



BetP(ω)

(6)

ω∈A

Obtaining BetP from the basic belief masses is called Pignistic Transformation. In [26], the author formalized the justification of the pignistic transformation based on linearity requirement. 4. Problem definition This section introduces the notation and basic definitions necessary for a detailed description of our proposal’s objective for building an associative classifier under TBM. The basic idea consists of considering the CAR − set as a bodies of evidence set, each one associated with a class value. After splitting the CAR − set into c subsets, we normalize the confidence values treating them as basic belief masses. Fig. 2 outlines the proposed method. Applying the pignistic transformation for each subset, we obtain c probability functions that summarize the regularities presented in the global CAR − set. Making a syntactic correspondence between Associative Classification and the Transferable Belief Model, the set of frequent items in D, If = {i0 , i1 , . . . , if −1 }, is the frame of discernment. The set of frequent itemsets is the starting point for the extraction of the whole set of CAR rules with the form Ri : X → c, where X ⊆ If and c ∈ C . For each c in C , we obtain a subset formed for all the rules with the same consequent. Each subset is formed by the set of rules Rc = {Rci : Xi → c }, characterized by its confidence values. Rc is a body of evidence, formed by the set of focal elements Xi , while the set of relative confidence values is the basic probability assignment; that is to say, m(Xi ) = norm_confidence(Rci ). Once this correspondence is established, we can compute its associate pignistic probability, which summarizes the information provided for each rule belonging to the body of evidence. The pignistic probability is defined by the set of frequent items, assigning a value depending on the set of rules where the items are presented. Following the idea of TBM, the confidence of a rule is distributed amongst the frequent items it contains. So, the pignistic value of a frequent item is computed taking into account the confidence and length of each rule where it is presented, always in the context of a particular body of evidence. In this context, the pignistic probability is called the pignistic pattern, or simply pattern, relative to a certain class label. Before formally defining the pattern concept, let us introduce new basic notation and definitions. As we pointed out, let D be a dataset with n discrete attributes (A), C the class attribute, If = {i0 , i1 , . . . , if −1 } the set of frequent items in D, and let CAR − set be the set composed of mined class association rules. We split the CAR − set into c subsets, one ⋃ set for each class value, so CAR − set = Rc , where Rc = {Rci : Xi → c }, Xi ⊆ If and c ∈ C . From each Rc we compute BetP c : If → [0, 1] using the formula: BetP c (ij ) =

∑ Rci ∈Rc ,ij ∈Xi

m(Rci )

|Rci |

(7)

Definition 11 (Pignistic Pattern). A pignistic pattern, denoted as BetP c , is the pignistic probability computed from Rc , that is, a function If → [0, 1] that synthesizes the regularities associated with a specific class value and serves as the basis for making decisions; namely, for predicting the class value for any query object.

F. Guil / Knowledge-Based Systems 182 (2019) 104800

5

Table 1 Weather dataset.

?

For a query object of the form Rq : X → c , we can obtain its membership degree to a class value c just by adding up the pignistic probabilities of the items presented in X . Definition 12 (Membership Degree). Let Rq : X → c ? be a query object, where X = {i0 , i1 , . . . , ik } is a frequent k-itemset. The membership degree of Rq to the class c is computed by:



BetP c (ij )

(8)

ij ∈X

Definition 13 (Pignistic Classifier). A pignistic classifier is an associative classifier formed by the set of derived pignistic patterns related to each class value, and a simple method to make predictions from it. So, Classifier D,σ

s ,σ c

= ({BetP c0 , BetP c1 , . . . , BetP cc −1 }, M)

Humidity

Windy

Play

85 80 83 70 68 65 64 72 69 75 75 72 81 71

85 90 86 96 80 70 65 95 70 80 70 90 75 91

False True False False False True True False False False True True False True

No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

Parameter

Value

Maximum number of rules Lower Minimum support Upper Minimum support Minimum confidence

10 000 0.01 1.0 0.1

Algorithm 1 Main algorithm. function main (D ataset, σ s , σ c , #rules) returns classifier

1: Dataset .randomize() 2: if Dataset .needDiscretization() then 3: D ataset .discretize() 4: end if 5: trainingData = Dataset .getTrainingData(67%) 6: testData = Dataset .getTestData(33%) 7: CAR − set = Apriori − mining(trainingData, σ s , σ c , #rules) 8: classifier = buildClassifier(CAR − set) 9: ev aluate(classifier , testData) 10: return classifier

(9)

Definition 14 (Making Predictions). Using this compact representation, we can predict (or estimate) the class value for any query object following the max rule: M(Rq ) = ci |BetP(Rq )ci = max{BetP(Rq )c0 , . . . , BetP(Rq )cc −1 }

Temperature

Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy

Table 2 CAR Mining parameters.

Fig. 2. Construction of a Pignistic Classifier.

BetP(Rq )c =

Outlook

(10)

5. Algorithms, method and case study Having formally described the method, in this section we specify its algorithmic design and its application to a well-known benchmark dataset, named Weather, a toy dataset commonly used in the literature to explain the basis of new proposed methods. Table 1 shows the instances (or examples) that comprise this dataset. In this problem, the key is to find a classifier that predicts if our children can play, or not, (the class value) depending on 4 meteorological attributes. The dataset describes 14 examples, 9 positive (Play = yes) and 5 negative (Play = no). This is a typical binary-class dataset, where instances are classified into one of the two possible class values. For generating and evaluating the pignistic classifier, we split the dataset following the 2/3-1/3 rule, so that 9 instances are used to build the classifier, and 5 used to test it. Algorithm 1 outlines the main function. For evaluating and characterizing the classifier, there are several methods in the Machine Learning area; in particular, the classical method that consists of splitting the dataset into training and test datasets and the cross-validation method. Although the

number of the folds varies depending on the problem, in [27,28] the authors use a 5-fold strategy because it is more appropriate when the datasets are imbalanced. Despite the peculiarities, in our experiments we propose the classical method, splitting the original dataset D into two datasets, the training data (DTr ), and the test data (DTe ), with a 2/3 to 1/3 proportion, respectively. Furthermore, in Associative Classification [24], accuracy is the metric used for evaluation; that is, the sum of True positive and negatives. However, as is quite clear from Machine Learning, using only this measure is insufficient for correctly evaluating a classifier; this is because of its high sensitivity to the distribution of the class attribute. In our proposal, we use not only accuracy, but also the weighted average of precision and recall (combined into a single derived measure named F − measure), and the well-known AUC measure. The algorithms for randomizing the dataset, discretizing the continuous attributes (Temperature and Humidity), splitting the dataset into training and test data, and mining class association rules, belong to the WEKA workbench. The discretization method is based on the MDL method presented in [29]. The mining algorithm is a particularly efficient version of the classical and well-known Apriori algorithm [21]. One interesting characteristic is that the algorithm iteratively reduces the minimum support until it finds the required number of rules (in this case it is also a user-defined parameter denoted as #rules). In Table 2 we specify the values of the mining-algorithm parameters used in the experiments. With these values for the user-defined parameters, we obtain a total of 70 class association rules (39 for Play = yes, and 31 for

6

F. Guil / Knowledge-Based Systems 182 (2019) 104800 Table 3 Frame of discernment. Frequent item

Description

i0 i1 i2 i3 i4 i5 i6

Humidity = All Outlook = rainy Outlook = sunny Temperature = All Windy = false Windy = true Outlook = overcast

Play = no). This rule set is split according to their class value (consequent), forming 2 knowledge bodies in the same frame of discernment, and characterized by their mass function obtained from the confidence measures. The frame of discernment is formed by the frequent items corresponding to the minimum support specified. In Table 3 we show the set If . We wish to highlight the discretization process carried out with the two numerical attributes, Humidity, and Temperature, where the method split the domains into only 1 segment, labeled All. This may indicate that the attributes are not relevant for the classification task; although, as we said before, it is only a toy dataset and so no significant technical result will be derived from it. Next, from each class value we obtain a pignistic function, the pignistic patterns BetP Play=yes and BetP Play=no , as components of the final classifier. Based on Eq. (7), in Algorithm 2 we outline the process for building a classifier from the CAR − set. The classifier variable is a pair formed by a set of pignistic patterns and a simple method to make predictions from it (Definition 13). The set of patterns is represented as a 2-dimensional map-based structure, designed for storing BetP values for every pair formed by a frequent itemset and class label. In linear time with respect to the size of CAR−set, the algorithm computes pignistic values for each rule distributing the mass amongst its atoms, that is to say, its frequent items. Algorithm 2 Building a classifier from the CAR − set. function buildClassifier (CAR − set) returns classifier Rc

1: = split(CAR − set); 2: classifier = (∅, M); 3: for all Rci ∈ Rc do 4: mci = normalizeConfidence(Rci ) c 5: for all Rj i ∈ Rci do 6: 7:

Fig. 3. Graphical representation of the classifier. Table 4 Components of the classifier. Frequent item

BetP Play=yes

BetP Play=no

i0 i1 i2 i3 i4 i5 i6

0.19516 0.09938 0.04041 0.19515 0.18708 0.08410 0.19872

0.19488 0.12302 0.19599 0.19488 0.08187 0.20936 0.0

Table 5 Test instances. #Instance

Outlook

Temperature

Humidity

Windy

Play

01 02 03 04 05

sunny Rainy Rainy Rainy Overcast

all all all all all

all all all all all

True False True False True

Yes Yes No Yes Yes

Table 6 Query objects: binary representation. #Instance

i0

i1

i2

i3

i4

i5

i6

01 02 03 04 05

1 1 1 1 1

0 1 1 1 0

1 0 0 0 0

1 1 1 1 1

0 1 0 1 0

1 0 1 0 1

0 0 0 0 1

c

for all ik ∈ Rj i do classifier .BetP [ik , ci ]+ =

c mci (R i ) j ci |R j |

8: end for 9: end for 10: end for 11: return classifier

Following with the Weather dataset, in Table 4 we show the classifier components returned by Algorithm 2. In this case, there are two pignistic patterns, each describing the regularities presented in both class labels, Play = yes, and Play = no. As we can see, for each frequent item, the classifier stores the bet value associated to each class label. This information can be easily interpreted in Fig. 3, where the probability functions are graphically shown. It is clear that both functions draw two distinct patterns, where the differences in the frequent items i2 , i4 , i5 , and i6 being the most significant. Although it would be a good idea to remove the frequent items because of the significant differences, in this paper we have considered the whole set, with each computing their mass into the global BetP functions. As we say before, the patterns in the classifier are stored in 2D matrix of dimension if × c, where if is the total number of

frequents items, and c is the number of class labels. Therefore, when evaluating the classifier we use a test dataset, which can be viewed internally as a matrix of dimensions n × if , where n = |DTe |. So, the evaluation process consists of multiplying both matrices, obtaining a n × c matrix with the evaluation of each instance; that is to say, the membership degree of each test instance to each class value. The predicted class is that with the maximum BetP value. Following the case study, the 5 test instances (or query objects) are shown in Table 5. Table 6 shows the binary representation of the test set indicating the presence or absence of each frequent item in the instances. For example, the item Outlook = sunny corresponds to the frequent item i2 (see Table 3), so, column i2 will be set to 1 only in row #01. Obviously, the class attribute is ignored in the evaluation phase, used to compare with the predicted class in order to compute performance values. The multiplication of both matrices (represented in Tables 4 and 6) gives us the classifier prediction for each query object, obtaining a matrix where each component indicates the membership degree of each frequent item to each class-label (the bet). Then, from these values, the classifier infers the class label

F. Guil / Knowledge-Based Systems 182 (2019) 104800 Table 7 Predicted class-label.

Table 10 Confusion matrix for Play = yes+ (and its symmetric for Play = no+ )

#Instance

BetP Play=yes

BetP Play=no

Predicted class

01 02 03 04 05

0.5158 0.6778 0.5748 0.6778 0.6741

0.7867 0.5947 0.7137 0.5947 0.5907

Play Play Play Play Play

= = = = =

No yes No Yes Yes

Table 8 Predicted vs real class labels. #Instance

Predicted class

Real class (Play)

01 02 03 04 05

Play Play Play Play Play

Play Play Play Play Play

= = = = =

No yes no yes yes

= = = = =

yes yes no yes yes

Real class+ Real class−

Predicted class+

Predicted class−

True Positives (TP) False Positives (FP)

False Negatives (FN) True Negatives (TN)

maximizing expectations; that is, assigning the class label with the maximum value (see Definition 14). In Table 7 we summarize the obtained results showing the BetP value obtained for each class label (the result of multiply both matrices), and the predicted class label, just by assigning the class label with the maximum BetP value. As one can observe in Table 8, 4 out of 5 query objects are correctly classified, providing a good accuracy measure equal to 0.8. However, as we know, this value only shows us the number of correct predictions, as it does not take the possible skew in the class distribution into account. It is necessary to use other performance measures to characterize our method, such as the weighted average accuracy, recall, precision, and the F − measure; or, as proposed in [30], the AUC (Area Under the receiver operating characteristic Curve) measure. These measures are derived from the confusion matrix shown in Table 9. For binary problems, where only two class labels exist (for simplicity, we will denote them as class+ and class− ), the confusion matrix shows four categories: (1) True Positives (TP) — the positive test instances correctly classified; (2) True Negatives (TN) — the negative test instances correctly classified; (3) False Positives (FP) — the negative test instances incorrectly classified such as positive query objects, and (4) False Negatives (FN) — the positive instances incorrectly classified such as negative test instances. In this binary classification problem, we can derive the following performance measures: TP TP + FP

TPrate = Recall = FPrate =

TP TP + FN

FP

FP + TN And, from these measures, we obtain a combined F − measure and AUC measure as follows: F − measure =

AUC =

Play = yes

Play = no

Play = yes Play = no

3 0

1 1

Play = no

Play = yes

Play = no Play = yes

1 1

0 3

Table 11 Evaluation of the associative classifier. Pignistic classifier

Accuracy

Precision

Recall

F –measure

AUC

Play = yes Play = no

0.80 0.80

1.00 0.50

0.75 1.00

0.86 0.67

0.88 0.88

Weighted average

0.80

0.90

0.80

0.82

0.88

Table 12 Results obtained with RIPPER.

Table 9 Confusion matrix.

Precision =

7

2 × Precision × Recall Precision + Recall

1 + TPrate − FRrate 2

RIPPER

Accuracy

Precision

Recall

F –measure

AUC

Play = yes Play = no

0.60 0.60

0.75 0.00

0.75 0.00

0.75 0.00

0.38 0.38

Weighted average

0.60

0.60

0.60

0.60

0.38

Table 13 Results obtained with C 4.5. C4.5

Accuracy

Precision

Recall

F –measure

AUC

Play = yes Play = no

0.60 0.60

0.75 0.00

0.75 0.00

0.75 0.00

0.38 0.38

Weighted average

0.60

0.60

0.60

0.60

0.38

For each class value we compute these measures, presenting the weighted average value taking into account the class distribution. Although this method is designed for 2-class problems, the extension for multi-class is straightforward, just by taking the class label under computation as the positive class, while the negative class is the sum of the remaining class values. Following our example, the confusion matrix interpretation for studying the class values Play = yes and Play = no is laid out in Table 10. From these confusion matrices, we obtain the performance measures shown in Table 11. In Tables 12 and 13 we show the results obtained with other rule-based and tree-based classifiers such as RIPPER [31] and Decision Trees (C 4.5) [32], two well-known classifiers commonly referenced in the literature. The purpose of this comparative analysis is purely to demonstrate the competitiveness of the pignistic classifiers compared to other classical classifiers; although the main aim is to propose a formal method to summarize huge quantities of rules instead of using complex and time-consuming heuristic-based methods and, above all, to obtain compact and light-weight classifiers valid for use in new devices designed with low computational resources. In order to highlight this idea, we point out that, from a training dataset formed of 9 instances with 5 attributes (including the class value), we have obtained a total of 70 class association rules; these have been summarized into two pignistic probability functions defined in a set formed of 7 frequent items. Hence, our compact classifier is represented by a 7 × 2 matrix managed by a simple algorithm to make predictions. It is clear that, in this case study, our classifier outperforms the other classifiers in terms of performance, mainly because it has correctly captured the regularities of the minority class Play = no, which was ignored in the previous two cases. This indicates that the method is appropriate in dealing with imbalanced datasets.

8

F. Guil / Knowledge-Based Systems 182 (2019) 104800

Table 14 Datasets.

Table 16 Empirical evaluation.

Name

#Attributes

#Instances

#Class+

#Class−

Adult Blogger Blood Breast Credit Diabetes Haberman Liver Labor Mushroom Trains Vote

15 6 5 10 16 9 4 11 17 23 33 17

48842 100 748 699 690 768 306 583 57 8124 10 435

11687 32 570 458 383 500 225 416 20 4208 5 267

37155 68 178 241 307 268 81 167 37 3916 5 168

Table 15 Number of rules generated by CBA and ACPRISM. Datasets

# of generated rules CBA ACPRISM

Breast Diabetes Mushroom Vote

7420 4075 21101 1152406

138 344 32 21

Average

296250.50

133.75

6. Empirical evaluation In order to validate the effectiveness of our proposal, we have conducted a series of experiments on 12 different binary-class datasets from the UCI data repository. Table 14 sums up the main characteristics of the datasets, including the number of attributes, the number of instances, and the distribution for each class label. The experimental conditions were the same for all the datasets, using the values for the user-defined parameters specified in Table 2. The obtained results are summarized in Table 16, showing the weighted average of the evaluation measures for the pignistic classifier and two well-known classification algorithms, RIPPER and C 4.5. On the one hand, it is clear that the relevance of precision and recall measures can be slightly different depending of the application domain. Therefore, it is difficult to take a metric that compares the different algorithms in this context. Nonetheless, in our paper, which is focused on the description of a method for obtaining associative classifiers, rather than its application to a specific application domain, both precision and recall will have the same weight in the analysis process. For this reason, we base our analysis on the F − measure, which integrates precision and recall as an average. On the other hand, AUC is the most commonly used measure for assessing classifiers in class imbalance problems, as is pointed out in [30]. Hence, we base our comparative analysis on both measures. Furthermore, when taken into account, it is clear that the results obtained with the 3 classifiers do not present significant differences (performing the non-parametric Wilcoxon test, as proposed in [33]), in several cases obtaining better F − measure and AUC values from our pignistic classifier — this indicates the effectiveness of our method in building accurate classifiers, which summarize a huge quantity of class association rules. In particular, in Table 16, we can see that when focusing on the AUC metric, Pignistic and C 4.5 were the joint best classifiers (ranked as #1) with the same frequency (50%). Besides the numerical analysis, we believe that the competitive advantage of using the TBM is the adaptation of a formal and simple method designed for managing uncertainty, inherently presented in these types of problems. Consequently, in all the

Dataset

Classifier

Accuracy

F –measure

AUC

Adult

Pignistic RIPPER C4.5

0.6804 0.8320 0.8645

0.7032 0.8260 0.8579

0.7527 0.7395 0.7746

Blogger

Pignistic RIPPER C4.5

0.7576 0.6364 0.6667

0.7439 0.5745 0.6405

0.7024 0.5357 0.5952

Blood

Pignistic RIPPER C4.5

0.6113 0.7368 0.7368

0.6465 0.7469 0.7469

0.6058 0.6500 0.6500

Breast

Pignistic RIPPER C4.5

0.9740 0.9654 0.9351

0.9741 0.9655 0.9350

0.9739 0.9641 0.9248

Credit

Pignistic RIPPER C4.5

0.7895 0.8377 0.8377

0.7890 0.8373 0.8368

0.7852 0.8336 0.8317

Diabetes

Pignistic RIPPER C4.5

0.7826 0.7747 0.7431

0.7860 0.7713 0.7434

0.7873 0.7438 0.7236

Haberman

Pignistic RIPPER C4.5

0.7426 0.7426 0.7426

0.7384 0.7384 0.7384

0.6168 0.6168 0.6168

Liver

Pignistic RIPPER C4.5

0.6354 0.7760 0.7760

0.6644 0.6782 0.6782

0.7237 0.5000 0.5000

Labor

Pignistic RIPPER C4.5

0.8947 0.8947 1.0000

0.8979 0.8979 1.0000

0.9231 0.9231 1.0000

Mushroom

Pignistic RIPPER C4.5

0.8381 1.0000 1.0000

0.8359 1.0000 1.0000

0.8331 1.0000 1.0000

Trains

Pignistic RIPPER C4.5

1.0000 0.6667 0.6667

1.0000 0.6667 0.6667

1.0000 0.7500 0.7500

Vote

Pignistic RIPPER C4.5

0.8542 0.9514 0.9583

0.8562 0.9516 0.9585

0.8751 0.9537 0.9593

studied cases, we obtain a classifier formed by 2 probability functions, which can be implemented as a 2 × if matrix (or mapbased 2D structure), managed by a simple algorithm to predict the class value for any query object. To reinforce this idea, in [8] the authors presented an effective algorithm, named ACPRISM, designed to obtain associative classifiers from several datasets; they showed the number of generated rules compared to a previous method called CBA. In Table 15 we show an extract of the data related to the same studied datasets. The significant advantage of the ACPRISM algorithm is clear when compared to CBA, taking the number of pruned and sorted rules as a comparative parameter. Moreover, using this comparative parameter, it is clear that the pignistic classifier significantly outperforms previous approaches. In the best case scenario, a complex data structure was needed to store an average of 133 rules; this required a complex heuristicbased algorithm to manage the structure for predicting the class value of any query object. In our case, it is only necessary to store a 2D matrix and implement a relatively simple prediction algorithm. 7. Conclusions and future work In this paper, we have presented a mathematical-based method for obtaining an associative classifier from the whole set of class association rules derived from a classification problem. Our method is based on the pignistic transformation proposed in the Transferable Belief Model, taking the CAR − set as a set of

F. Guil / Knowledge-Based Systems 182 (2019) 104800

bodies of evidence formed by rules with the same consequent (class label). The associative classifiers obtained, called pignistic classifiers, are formed by c probability functions defined in the frequent items set, where c is the number of class labels presented in the dataset. The experimental evaluation shows the efficacy of our proposal in terms of performance and efficiency; the main advantage being to obtain a compact, light-weight, white-box classifier which needs few stored computational resources to make predictions. This is an important issue if we consider that modern knowledge-based systems are actually deployed on resource-constrained platforms such as smartphones, tablets and other embedded devices. Moreover, having a whitebox classifier formed by pignistic patterns that summarize and describe the regularities presented in each class population can provide useful explanations regarding the application domain; this can help users in the important task of knowledge discovery. As future work, we propose two main issues for investigation. First, that we consider the whole set of class association rules, regardless of the possibility that redundant rules exist which affect in the final performance of the derived classifier. Consequently, we propose the use of classical pruning techniques and, especially, the obtaining of a special type of rules based on emerging patterns [34]. Secondly, we have used a simple prediction method based on the max-sum formula. It would be convenient, in terms of research, to compare this simple technique with other efficient inference techniques, particularly those focused on computing the distance of the query object from each pignistic pattern [35]. References [1] B. Liu, W. Hsu, Y. Ma, Integrating classification and association rule mining, in: Proc. of the 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD’98), AAAI Press, 1998. [2] C. Aggarwal, J. Han, Frequent Pattern Mining, Springer, 2014. [3] F. Thabtah, P. Cowling, S. Hammoud, Improving rule sorting predictive accuracy and training time in associative classification, Expert Syst. Appl. 31 (2006) 414–426. [4] N. Abdelhamid, A. Ayesh, F. Thabtah, S. Ahmadi, W. Hadi, MAC: A multiclass associative classification algorithm, Inf. Knowl. Manag. 11 (2) (2012) 1250011–1–1250011–10. [5] C. Chen, R. Chiang, C. Lee, C. Chen, Improving the performance of association classifiers by rule priorization, Knowl.-Based Syst. 36 (2012) 59–67. [6] L. Nguyen, B. Vo, T. Hong, H. Thanh, CAR-Miner: An efficient algorithm for mining class-association rules, Expert Syst. Appl. 40 (2013) 2305–2311. [7] W. Hadi, F. Aburub, S. Alhawari, A new fast associative classification algorithm for detecting phishing websites, Appl. Soft Comput. 48 (2016) 729–734. [8] W. Hadi, G. Issa, A. Ishtaiwi, ACPRISM: Associative classification based on prism algorithm, Inform. Sci. 417 (2017) 287–300. [9] K. Song, K. Lee, Predictability-based collective class association rule mining, Expert Syst. Appl. 79 (2017) 1–7. [10] P. Smets, R. Kennes, The transferable belief model, Artificial Intelligence 66 (1994) 191–234.

9

[11] T. Denoeux, P. Smets, Classification using belief functions: the relationship between the case-based and model-based approaches, IEEE Trans. Syst. Man Cybern. B 36 (6) (2006) 1395–1406. [12] P. Smets, Belief functions: the disjunctive rule of combination and the generalized bayesian theorem, Internat. J. Approx. Reason. 9 (1993) 1–35. [13] T. Denoeux, Analysis of evidence-theoretic decision rules for pattern classification, Patterns Recognit. 30 (7) (1997) 1095–1107. [14] T. Denoeux, A neural network classifier based on Dempster–Shafer theory, IEEE Trans. Syst. Man Cybern. A 30 (2) (2000) 131–150. [15] Z. Liu, Q. Pan, J. Dezert, G. Mercier, Hybrid classification system for uncertain data, IEEE Trans. Syst. Man Cybern. 47 (10) (2016) 2783–2790. [16] Z. Liu, Q. Pan, J. Dezert, A. Martin, Combination of classifiers with optimal weight based on evidential reasoning, IEEE Trans. Fuzzy Syst. 26 (3) (2017) 1217–1230. [17] P. Xu, X. Su, S. Mahadevan, C. Li, Y. Deng, A non-parametric method to determine basic probability assignment for classifications problems, Appl. Intell. 41 (2014) 681–693. [18] Y. Zhou, X. Tao, L. Luan, Z. Wang, Safety justification of train movement dynamic processes using evidence theory and reference models, Knowl.-Based Syst. 139 (2018) 78–88. [19] T. Denoeux, Decision-making with belief functions, Intern. J. Approx. Reason. 109 (2019) 87–110. [20] Y. Liu, Y. Jiang, X. Liu, S. Yang, CSMC: A combination strategy for multiclass classification based on multiple association rules, Knowl.-Based Syst. 21 (2008) 786–793. [21] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: Proc. of 20th Int. Conf. on Very Large Data Bases (VLDB’94), Morgan Kaufmann, 1994. [22] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten, The WEKA data mining software: An update, SIGKDD Explor. 11 (1) (2009) 10–18. [23] P. Garza, E. Baralis, A lazy approach to pruning classification rules, in: Proc. of the IEEE Int. Conf. on Data Mining, IEEE Computer Society, 2002. [24] N. Abdelhamid, F. Thabtah, Associative classification approaches: Review and comparison, J. Inf. Know. Manag. 13 (3) (2014) 30. [25] G. Shafer, A Mathematical Theory of Evidence, Princenton University Press, Princenton, NJ, 1976. [26] P. Smets, Decision making in the TBM: the necessity of the pignistic transformation, Int. J. Approx. Reason. 38 (2005) 133–147. [27] S. García, Z. Zhang, A. Altalhi, S. Alshomrani, F. Herrera, Dynamic ensemble selection for multi-class imbalanced datasets, Inform. Sci. 445 (2018) 22–37. [28] J. Bi, C. Zhang, An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme, Knowl.-Based Syst. 158 (2018) 81–93. [29] U. Fayyad, K. Irani, Multi-interval discretization of continuous valued attributes for classification learning, in: Proc. of the 13th International Joint Conference on Artificial Intelligence, 1993. [30] O. Loyola-González, M. Medina-Pérez, J. Martínez-Trinidad, J. CarrascoOchoa, R. Monroy, M. García-Borroto, Pcb4cip: A new contrast patternbased classifier for class imbalance problems, Knowl.-Based Syst. 115 (2017) 100–109. [31] W. Cohen, Fast effective rule induction, in: Twelfth Int. Conf. on Machine Learning, 1995. [32] R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993. [33] J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [34] M. Garcia-Borroto, J. Martinez-Trinidad, A survey of emerging patterns from supervised classification, Artificial Intelligence 42 (2014) 705–721. [35] J. Lin, Divergence measures based on shannon entropy, IEEE Trans. Inform. Theory 37 (1) (1991) 145–151.