Classification with test costs and background knowledge

Knowledge-Based Systems 92 (2016) 35–42 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/...

Download PDF

1MB Sizes 3 Downloads 77 Views

Report

PDF Reader
Full Text

Knowledge-Based Systems 92 (2016) 35–42

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Classiﬁcation with test costs and background knowledge Tomasz Łukaszewski∗, Szymon Wilk Computer Science Department, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland

a r t i c l e

i n f o

Article history: Received 30 April 2015 Revised 30 September 2015 Accepted 7 October 2015 Available online 22 October 2015 Keywords: Test costs Levels of abstraction Naïve Bayes classiﬁer

a b s t r a c t We propose a novel approach to the problem of the classiﬁcation with test costs understood as costs of obtaining attribute values of classiﬁed examples. Many existing approaches construct classiﬁers in order to control the tradeoff between test costs and the prediction accuracy (or misclassiﬁcation costs). The aim of the proposed method is to reduce test costs while maintaining of the prediction accuracy of a classiﬁer. We assume that attribute values are represented at different levels of abstraction and model domain background knowledge. Our approach sequentially explores these levels during classiﬁcation – in each iteration it selects and conducts a test that precises the representation of a classiﬁed example (i.e., acquires an attribute value), invokes a naïve Bayes classiﬁer for this new representation and checks the classiﬁer’s outcome to decide whether this iterative process can be stopped. The selection of the test in each iteration takes into account the possible improvement of the prediction accuracy and the cost of this test. We show that the prediction accuracy obtained for classiﬁed examples represented precisely (i.e., when all the tests have been conducted and all speciﬁc attribute values have been acquired) can be achieved for a much smaller number of tests (i.e., when not all speciﬁc attribute values have been acquired). Moreover, we show that without levels of abstraction and with uniform test costs our method can be used for selecting features and it is competitive to popular feature selection schemes: ﬁlter and wrapper. © 2015 Elsevier B.V. All rights reserved.

1. Introduction One of the main tasks of machine learning is to build classiﬁers from available data. Constructed classiﬁers, after their evaluation, are applied in many real-world applications, in medical diagnosis, automated testing, robotics, industrial production processes and many other areas. The most commonly used evaluation criterion of a classiﬁer is its predictive accuracy. The measure of the prediction accuracy of a classiﬁer is often replaced by the measure of the misclassiﬁcation costs of a classiﬁer, because different errors may have different costs. On the other hand, more and more attention is paid to test costs, that is the cost of obtaining attribute values (features) of classiﬁed examples. The cost associated to a feature can be related to different concepts: expenses, risks or computational costs [1]. In order to decrease the total cost of these tests we may reduce their number allowing for missing attribute values in the representation of classiﬁed examples. However, missing values of relevant attributes in the representation of classiﬁed examples usually degrade the predictive accuracy of a classiﬁer (or increases the misclassiﬁcation costs of a classiﬁer) [2,3]. Therefore, we have to decide which tests should be carried out in order to control the tradeoff between the cost of these tests and

∗

Corresponding author. Tel.: +48 616652920. E-mail address: [email protected] (T. Łukaszewski).

http://dx.doi.org/10.1016/j.knosys.2015.10.008 0950-7051/© 2015 Elsevier B.V. All rights reserved.

the accuracy of a classiﬁer (or the tradeoff between test costs and misclassiﬁcation costs of a classiﬁer) [4,5]. However, in many real applications it is very diﬃcult to evaluate misclassiﬁcation costs. For example, in medical diagnosis, how much money you should assign for a misclassiﬁcation cost, when a misclassiﬁcation hurts a patient’s life? In such cases, we should concentrate on the tradeoff between test costs and the accuracy of a classiﬁer. The appropriate approach may be to reduce test costs while maintaining the prediction accuracy of a classiﬁer. This goal may be achieved by cost-based feature selection methods [1]. Let us notice that standard feature selection methods were designed to handle plain data without any type of generalization of attribute values. However, there are areas where attribute values are represented at different levels of abstraction. These levels model domain background knowledge and have usually a form of a tree-like hierarchy. In such a tree the root represents a missing value, leaves represent speciﬁc attribute values, and the remaining nodes represent abstract attribute values (e.g., sets of speciﬁc values). Importantly, such hierarchies entail the existence of tests that replace a more abstract value by a less abstract value. Moreover, this replacement may be taken in several stages for a given attribute, going from the root of a hierarchy towards less abstract values. We assume that for some classiﬁed data a decision may be taken based on (less or more) abstract attribute values. Without levels of abstraction, precise attribute values had to be acquired in such a case. Assuming that the

36

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

cost of obtaining an abstract value is less than the cost of obtaining a precise value that is a reﬁnement of this abstract value, we see that introducing levels of abstraction should allow to further reduce test costs. Moreover, the exploration of these levels of abstraction in the context of a classiﬁed example should result in lower test costs than their exploitation during learning or even earlier, during the data preprocessing. Unfortunately, the approaches proposed so far that take into account levels of abstraction or more general ontologies are intended to obtain only models that are simpler and their classiﬁcation accuracy is preserved or improved (test costs are not considered) (e.g., [6–9]). In this paper we present a novel approach to the problem of a classiﬁcation with test costs. Our approach sequentially explores levels of abstraction during classiﬁcation – in each iteration it selects and conducts a test that precises the representation of a classiﬁed example (i.e., acquires an attribute value), invokes a naïve Bayes classiﬁer for this new representation and checks the classiﬁer’s outcome to decide whether this iterative process can be stopped. The selection of the test in each iteration takes into account the possible improvement of the prediction accuracy and the cost of this test. We show that the prediction accuracy obtained for classiﬁed examples represented precisely (i.e., when all the tests have been conducted and all speciﬁc attribute values have been acquired) can be achieved for a much smaller number of tests (i.e., when not all speciﬁc attribute values have been acquired). Moreover, we show that without levels of abstraction and with uniform test costs our method can be used for selecting features and it is competitive to popular feature selection schemes: ﬁlter and wrapper. The novelty of the paper is twofold. First, the stopping criterion of this sequential process explores the classiﬁer outcome for the current and previous stages of the sequential process. Second, our approach allows for representing attribute values at different levels of abstraction in order to model domain background knowledge. These two elements allow to achieve the aim of our research. The method presented in the paper is based on the results of our earlier works. In [3] we showed that missing values of attributes with small value of information gain does not reduce prediction accuracy. In [10] we showed that the prediction accuracy of the sequential classiﬁcation process, that is applied in the paper converges very quickly to the prediction accuracy achieved for the examples represented precisely. The method presented in the paper adds the stopping criterion to this sequential classiﬁcation and presents the experimental evaluation of the proposed approach. The paper is organized as follows. Section 2 recalls the existing approaches to the problem of classiﬁcation with test costs. Section 3 presents the idea of representing background knowledge (attribute values and tests) by levels of abstraction. It also describes a naïve Bayes classiﬁer generalized to these levels of abstraction. Section 4 describes the concept of sequential classiﬁcation and the stopping criterion for this strategy. Section 5 shows the results of the experimental evaluation of the proposed method. Section 6 concludes the paper. 2. Related works The detailed review of algorithms that take into account test costs and/or misclassiﬁcation costs are presented in [11,12]. However, not all the algorithms consider the aforementioned tradeoff. Thus, we intend to indicate these approaches, where this tradeoff is considered. The problem of the tradeoff between the cost of tests and the accuracy of a classiﬁer was considered in [13] (IDX), [14] (EG2), [15] (CS-ID3) and [16] (Clarify). All these approaches combine information gain and test costs in order to construct decision trees. The problem of a tradeoff between the cost of tests and the misclassiﬁcation cost of a classiﬁer also was extensively analyzed. In [4] a system called ICET, which uses a genetic algorithm to build a

decision tree to minimize the cost of tests and misclassiﬁcations was presented. In [17] the theoretical aspects of active learning with test costs using a PAC learning framework were studied. It is a theoretical work on a dynamic programming algorithm searching for best diagnostic policies measuring at most a constant number of attributes. The obtained result is not applicable in practice, because it requires a predeﬁned number of training data in order to obtain suboptimal policies. In [18] an algorithm based on formulating the classiﬁcation process as a Markov Decision Process (MDP), whose optimal policy gives the optimal diagnostic procedure was presented. While related to other work, this approach may incur very high computational cost to conduct the search. In [19] a tree-building strategy was proposed that uses minimum cost of tests and misclassiﬁcations as the attribute split criterion. In [20] a naïve Bayesian based cost-sensitive learning algorithm, called csNB was proposed in order to minimize the sum of test costs and misclassiﬁcation costs. In [21] tree-building strategies were proposed: sequential test strategy, single batch strategy and multiple batch strategy. The comparison of these strategies showed that the total cost of the sequential test strategy is the lowest. In [22] a framework based on game theory was employed in order to build a cost-sensitive decision tree. The empirical evaluation of the proposed algorithm showed that it is possible to induce decision trees that maintain prediction accuracy but also minimize test and misclassiﬁcation costs. However, there are a number of parameters which can be set in order to change the behavior of the algorithm in response to the differing test costs and misclassiﬁcation costs. In [23– 26] the problem of cost-sensitive classiﬁcation with multiple cost scales was considered. The empirical comparison of cost-sensitive decision tree induction algorithms was presented in [27]. This comparison took into account 30 algorithms, which can be organized into 10 categories. The lowest cost is produced by a system ICET. It was indicated that high accuracy rates do not always mean low classiﬁcation costs. Moreover, having an inexpensive decision tree does not automatically mean that it is an accurate decision tree. The problem of reducing test costs while maintaining the prediction accuracy of a classiﬁer is also considered in the context of costbased feature selection. Methods that can deal with large-scale and real-time applications are urgently needed since costs must be budgeted and accounted for [1]. In [28] a genetic algorithm was used to perform feature selection where the ﬁtness function combined two criteria: the accuracy of the classiﬁcation realized by the neural network and the cost of performing the classiﬁcation. In [29] a similar approach was presented, where a genetic algorithm is used in feature selection and parameters optimization for a support vector machine. The ﬁtness function aggregated classiﬁcation accuracy, the number of selected features and the feature cost. However, the above mentioned methods have the disadvantage of being computationally expensive. Therefore, a modiﬁcation of a ﬁlter model, which is known to have a low computational cost was proposed in [30]. The presented modiﬁcation adds to the features evaluation function a term to take into account the cost of the features. In [31] two main components of test-time CPU cost were examined (i.e., classiﬁer evaluation costs and feature extraction costs) and it was shown how to balance these costs with classiﬁcation accuracy.

3. Representing background knowledge by levels of abstraction Let us notice that there are areas where attribute values are represented at different levels of abstraction. These levels model domain background knowledge and have usually a form of a tree-like hierarchy [6]. In such a tree the root represents a missing value, leaves represent speciﬁc attribute values, and the remaining nodes represent abstract attribute values (e.g., sets of speciﬁc values). Importantly, such hierarchies entail the existence of tests that replace a more abstract value by a less abstract value. Moreover, this replacement may

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

37

Infectious Agent (t1 )

Bacteria (t2 )

Gram-positive

Gram-negative

Bacteria (t3 )

Bacteria (t4 )

Streptococcous

E.Coli

Salmonella

Fungi

Virus

Fig. 1. Example of an attribute value ontology.

Infectious Agent (t0 )

Streptococcous

E.Coli

Salmonella

Fungi

Virus

Fig. 2. Example of a trivial attribute value ontology.

be taken in several stages for a given attribute, going from the root of a hierarchy towards less abstract values. We assume that for some classiﬁed data a decision may be taken based on (less or more) abstract attribute values. Without levels of abstraction, precise attribute values had to be acquired in such a case. Assuming that the cost of obtaining an abstract value is less than the cost of obtaining a precise value that is a reﬁnement of this abstract value, we see that introducing levels of abstraction should allow to reduce test costs. In order to formally represent attribute values at different levels of abstraction we introduce an attribute value ontology (AVO) [32].

cepts are the following: Infectious Agent, Bacteria, Gram-positive Bacteria, Gram-negative Bacteria. The concept Infectious Agent is interpreted as a missing attribute value. Let us explain, that AVO can be build even then, when a set of abstract concepts is restricted to the missing value. In such a case we have a trivial hierarchy – a tree of a height equal to 1. In the context of our medical problem, such a trivial hierarchy is presented in Fig. 2.

3.2. Representation of tests 3.1. Attribute value Ontology Deﬁnition 1. Given is an attribute A and a set V = {v1 , v2 , . . . , vn }, n > 1, of speciﬁc values of this attribute. An attribute value ontology is a pair A = CA , R, where: CA is a set of concepts, R is a subsumption relation over CA . Subset CP ⊆ CA of concepts without subconcepts is a ﬁnite set of atomic concepts of A. Atomic concepts CP represent the speciﬁc values of A. Abstract concepts CA ࢨCP represent more general (imprecise) values of A. In general, AVO can be a directed acyclic graph. For the simplicity of a presentation, in this paper we restrict the above deﬁnition by assuming that each concept has at most one direct superconcept (a parent) and direct subconcepts (children) of each concept are mutually exclusive. Such AVO has a tree structure. The root of this tree is interpreted as a missing attribute value. Example 1. Let us consider the following medical problem. In order to determine the correct treatment, an agent that caused the infection needs to be speciﬁed. A hierarchy describing the domain of infectious agents is presented in Fig. 1. Although all viral infections determine the same treatment (like infections caused by fungi), identiﬁcation of the bacteria type is important in order to decide about the appropriate treatment. Thus, speciﬁc values of this attribute are the following: Streptococcus, E.Coli, Salmonella, Fungi, Virus. Atomic concepts from hierarchy represent these speciﬁc values, abstract con-

The existence of AVO entails the existence of tests that replace an abstract concept by a less abstract concept (a descendant of this abstract concept). These tests may be modeled as a function t: CA ࢨCP → CA . However, in the paper we assume that each test replaces an abstract concept by a one of its direct subconcepts (a child of this abstract concept). Following this assumption, in order to replace an abstract concept by a more precise concept (atomic or abstract one), the number of required tests is equal to the length of the path between these two concepts. In the context of our medical example, we have the following tests: (t1 ) Infectious Agent is Bacteria or Fungi or Virus, (t2 ) Bacteria is Gram-positive or Gram-negative, (t3 ) Gram-positive Bacteria is Streptococcous or not, (t4 ) Gram negative Bacteria is E.Coli or Salmonella. Therefore, in order to reveal, that the Infectious Agent is Streptococcous, we have to carry out three tests: t1 , t2 and t3 . Considering trivial AVO, in the context of our medical example, there would be only one test: (t0 ) Infectious Agent is Streptococcous or E.Coli or Salmonella or Fungi or Virus. In the real world each test has its own cost in terms of time, money, impact on patients’ health or other units. Moreover, in most practical problems, tests at lower levels of AVO are more elaborated and their cost is higher than the cost of tests at higher levels of AVO (e.g., in the context of our medical example cost(t1 ) < cost(t2 ) < cost(t3 )). The cost of achieving a speciﬁc value in the setting with AVO may be more expensive than the cost in the setting with trivial AVO (e.g., (cost (t1 ) + cost (t2 ) + cost (t3 )) > cost (t0 ).

38

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

However, we assume that for some classiﬁed data a decision may be taken based on a (less or more) abstract attribute value in AVO instead on a precise value (a reﬁnement of this abstract value) in trivial AVO. Assuming that the cost of obtaining an abstract value in AVO is less than the cost of obtaining the precise value in trivial AVO, we see that introducing AVO should allow to reduce test costs. In the context of our medical example, assuming that a speciﬁc value is Streptococcous and the decision may be taken based on abstract value Gram-positive Bacteria in AVO and (cost (t1 ) + cost (t2 )) < cost (t0 ), we may reduce test costs. For the simplicity of the presentation, we assume in the paper that each test has the same cost (e.g., in terms of expenses). We are going to consider different costs in our future work.

where nci ,C j and nck ,C is the number of training examples chari

j

acterized by concepts ci or cik respectively and class label Cj , and desc(ci , Ai ) is the set of concepts that are descendants of the concept ci in Ai . If we assume cik is an atomic reﬁnement of ci (i.e., cik is an atomic concept and a descendant of ci ), then we can draw an analogy between formula (5) and value weights introduced in [34]. Speciﬁcally,

P (ci |C j ) = P (cik |C j )wi j k

(6)

where wkij depends on the distance between ci and the root of Ai . If ci constitutes the root (i.e., it represents a missing value), then wkij is equal to 0.0, and if ci represents an atomic concept, then wkij is equal to 1.0. Otherwise, wkij belongs to the interval (0.0, 1.0).

3.3. Ontological Bayes classiﬁer In [32] we showed how to extend the naïve Bayes classiﬁer to handle AVO. We call this classiﬁer an ontological Bayes classsifer (OBC). Below the idea of this extension is presented. First, we remind the deﬁnition of the naïve Bayes classiﬁer. We assume that given is a set of m attributes A1 , A2 ,…, Am , m > 0. An example is represented by a vector (v1 , v2 , . . . , vm ), where vi is the speciﬁc value of Ai . Let C represent the class variable and Cj ( j = 1 . . . k, k > 1) represent its possible value (a class label). The naïve Bayesian classiﬁer assumes that the attributes are conditionally independent given the class variable, which gives us:

P (C j |v1 , v2 , . . . , vm ) ∝ P (C j )

P (vi |C j )

(1)

i

where P(vi |Cj ) is the probability of an example from class Cj having the observed value of attribute Ai equal to vi . The probabilities in the above formula may be estimated from training examples, e.g. using relative frequency:

P (C j ) =

nC j N

P (vi |C j ) =

nvi ,C j

(2)

nC j

where N is the number of training examples, nC j is the number of training examples with class label Cj , nvi ,C j is the number of training examples with the value vi of an attribute Ai and class label Cj . In order to make the estimates P(vi |Cj ) robust with respect to infrequent data, it is common to use Laplace estimates:

P (vi |C j ) =

nvi ,C j + 1

(3)

nC j + Vi

where Vi is the total number of values of attribute Ai . Another approach to estimate P(vi |Cj ) is m-estimation [33]:

P (vi |C j ) =

nvi ,C j + pa m

(4)

nC j + m

where pa is a priori probability of Cj and m is a parameter of the estimation method. After augmenting the representational language by AVO, the naïve Bayes classiﬁer needs to be extended to estimate P(ci |Cj ), where ci is an atomic or abstract concept from Ai . The probabilities in the above formula may be estimated from training examples, e.g. using relative frequency. The proposed extension count training examples with the value ci and more precise values in Ai and class label Cj . Let us recall, that for a given concept ci from Ai , all the concepts that are more precise than ci are the descendants of ci . In order to estimate P(ci |Cj ), by relative frequency, we use the following property, which gives us:

P (ci |C j ) =

nci ,C j +

cik ∈desc(ci ,Ai )

nC j

nck ,C j i

(5)

4. Sequential classiﬁcation When tests are cheap it may be rational to conduct all tests (i.e., to determine the values of all attributes). In this kind of situation, it is convenient to separate the selection of tests from the process of making classiﬁcation. When tests are expensive the interleaving the selection of tests and making a classiﬁcation should give the lowest sum of costs. The outcome of the test gives us information that we can use in order to select the next test or decide that the cost of further tests is not justiﬁed and stop testing [4]. Such an approach is very often observable in the real-world. Let us imagine a scenario, when a patient with some worrying signs and symptoms should be diagnosed and these symptoms are not suﬃcient to make an accurate diagnosis. Hence, some tests should be conducted in order to obtain a more precise description of the patient’s state. At each step of this sequential process, the results from previous steps determine whether a further test is required to gather more information. Moreover, the selection of the next test should take into account the cost of this test and the possible improvement of prediction accuracy (or the reduction of misclassiﬁcation costs). The proposed method sequentially performs the following tasks: selects and conducts a test that precises the representation of a classiﬁed example, calculates a new outcome of OBC for this new representation and veriﬁes whether this iterative process can be stopped. Based on [3,10] we argue, that this sequential classiﬁcation with the properly constructed stopping criterion of this process should allow to reduce test costs while maintaining prediction accuracy. The pseudocod of this sequential ontological Bayes classiﬁer (sOBC) is presented in Listing 1. The test selection and stopping criterion are deﬁned in the next subsections. The sOBC starts with an initial representation of a classiﬁed example E. This initial representation is marked as E0 (lines 1 and 2) and the outcome of OBC for E0 is calculated for all Cj (line 3). While the current representation of a classiﬁed example Ei contains abstract values (line 4) there is a possibility to select a test ti (lines 5 and 6) and reﬁne the representation precision of a classiﬁed example (line 7). In the paper we assume, that each test replaces an abstract concept by a one of its direct subconcepts (see Section 3.2). For this new reﬁned representation the outcome of OBC is calculated for all Cj (line 8). If the stopping criterion is activated then sOBC stops and returns the last outcome of OBC (line 11); else sOBC try to further reﬁne the representation of Ei (line 4). If the current representation of a classiﬁed example Ei does not contain abstract values (line 4) sOBC also stops and returns the last outcome of OBC (line 11). Let us notice, that for a classiﬁed example that initial representation does not contain abstract values, sOBC calculate the outcome of OBC (line 3) and stops (line 11).

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

Algorithm 1 sequential ontological Bayes classiﬁer (sOBC) Input: a classiﬁed example E Output: a classiﬁcation result of E after conducting i tests: P(Cj |Ei ) 1. i = 0 2. E0 = E 3. calculate the outcome of OBC: P(Cj |E0 ) 4. while the representation of Ei contains abstract values do 5. i = i + 1 6. select the test ti 7. Ei = t i (E i−1 ) 8. calculate the outcome of OBC: P(Cj |Ei ) 9. if the stopping criterion is activated then break 10. end while 11. return P(Cj |Ei )

39

level of precision we have checked the outcome of OBC (the probability of the most probable class). We have noticed, that during this sequential classiﬁcation of a test case, the most probable class changes only a few times. Moreover, the change of the most probable class is very often associated with the decrease of the probability of the most probable class. On the other hand, the subsequent outcomes pointing at the same most probable class are usually associated with increasing values of the probability. To account for this observation, we have introduced a stopping criterion that considers the most probable classes and the associated probabilities in two subsequent steps. The proposed criterion stops the sequential classiﬁcation process after obtaining two subsequent OBC outcomes C i−1 , Ci of the same class label:

C i−1 = arg max j

C i = arg max j

P (C j |E i−1 ) P (C j |E i )

(9)

C i−1 = C i

4.1. Test selection

with the probability greater than a threshold value tv: Let us assume that for a current step of a sequential classiﬁcation a classiﬁed example E is represented by a vector (c1 , c2 ,…, cn ), where ci is a concept from Ai (i.e., a concept of AVO associated with attribute Ai ). In this paper we assume that for the selected Ai the representation precision of E is improved by indicating a concept cik that is a child of ci in Ai . Therefore, the selection of a test reduces to the selection of Ai for which we can precise a concept by going down one level in the hierarchy. We propose to apply the information gain measure to select Ai at each step of the sequential classiﬁcation. Let Si is a set of training examples that are represented using the concept ci of Ai or the descendants of ci . We deﬁne a measure M as follows:

M(E, Ai ) = Entropy(Si ) −

cik ∈children(ci ,Ai )

|Sik | Entropy(Sik ) |Si |

(7)

where children(ci , Ai ) is the set of concepts that are children of the concept ci in Ai , and Sik is the subset of Si of training examples that are represented using the concept cik or its descendants. The sequential classiﬁcation considered in the paper selects at each step Ai with the highest value of the measure M. Recall that the assumption of the paper is that each test has the same cost. For the setting with different test costs, a new measure should be applied. For example a measure that takes into account the information gain and test costs was proposed in [13]:

M(E, Ai , ti ) = M(E, Ai )/cost (ti )

(8)

where cost(ti ) is the cost of a test ti that makes a concept ci in Ai more precise. 4.2. Stopping criterion The second key element of the sequential classiﬁcation is the stopping criterion. Without such a stopping criterion sOBC would be stopped by reaching an atomic concept (a speciﬁc attribute value) for each attribute. Consequently, all tests would be carried out. Let us remind that our goal is to reduce the cost of required tests. Let us notice that we have an access to a classiﬁcation result for the current and previous steps of sOBC for a given classiﬁed example. We assume, that the outcomes of OBC (probabilities) are normalized. The sequence of these results can be explored in order to deﬁne the stopping criterion. For each test case we have explored subsequent levels of representation precision starting from the most general representation (values of all attributes are unknown) to the most precise representation (speciﬁc attribute values of all attributes are known). For each

P (C i−1 |E i−1 ) > t v P (C i |E i ) > t v

(10)

The introduction of the outcome conﬁrmation was necessary in order to avoid a single outcome with the probability above the threshold value with wrong class label. We expect, that the smaller the value of the threshold the earlier the stopping criterion would be activated and the greater reduction of the number of performed tests would be observed. However, too excessive reduction of tests may lead to a signiﬁcant deterioration of prediction accuracy. Therefore, the threshold value should be set experimentally (see Section 5 for details). 5. Evaluation of the proposed method The experiment was aimed at examining the prediction accuracy and the reduction of the test costs for different thresholds. Moreover, we have compared our approach with two popular feature selection schemes: ﬁlter and wrapper. 5.1. Experimental design We used 8 from 37 data sets considered by Zhang et al. in their studies on applying attribute value taxonomies (AVT) in naïve Bayes classiﬁer [6]. Such selection has resulted from limitations of our computational platform that does not handle numerical attributes. We also employed AVTs developed by Zhang and colleagues for these data sets. Three datasets among these selected (Mushroom, Soybean and Nursery) have attribute value taxonomies (AVTs) supplied by domain experts. For the remaining ﬁve datasets (Audiology, Breast Cancer, Car Evaluation, Dermatology, Zoo) we have considered AVTs generated by Zhang et al. using their AVT-Learner tool. AVT is the simplest form of AVO, that is also considered in this paper. Before running the experiment, we preprocessed the data sets by removing decision classes with less than 10 examples (such heavily underrepresented classes may have affected the evaluated performance). Table 1 lists the data sets and their ﬁnal characteristics – the data sets that were affected during preprocessing are marked with ‘∗’ and their original characteristics are given in brackets. In this table we also indicate the class imbalance in the considered data sets by reporting the percentage of examples in most frequent and least frequent class. Finally, for each data set we list the number of non-trivial AVOs. Note that all data sets, except the Car Evaluation data set, are associated with a mixture of non-trivial AVOs and trivial AVOs. The experimental design relied on the stratiﬁed 10-fold cross validation, which was repeated 10 times. Learning sets used to develop

40

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42 Table 1 Benchmark data sets used in the experiments. Data set ∗

Audiology ( ) Breast cancer Car evaluation Dermatology Mushroom Nursery (∗ ) Soybean (∗ ) Zoo (∗ )

Examples

Classes

Imbalance: max–min [%]

Attributes

AVOs

169 (226) 286 1728 366 8124 12958 (12960) 675 (683) 84 (101)

5 (24) 2 4 6 2 4 (5) 18 (19) 4 (7)

33.7–11.8 (25.2–0.4) 70.3–29.7 70.0–3.8 30.6–5.5 51.8–48.2 33.3–2.5 (33.3–0.0002) 13.6–2.1 (13.5–1.2) 48.8–11.9 (40.6–4.0)

69 9 6 34 22 8 35 16

8 6 6 33 17 6 19 1

Table 2 Maximum degree of test costs reduction without the deterioration of prediction accuracy. Data set

tv∗

Gain [%]

O sOBCtAV v=t v∗ [%]

O sOBCtAV v=1 [%]

Audiology Breast Car Dermatology Mushroom Nursery Soybean Zoo

0.99 0.6 0.6 0.99 0.99 0.75 0.95 0.99

76.2 90.1 48.4 75.9 89.4 57.0 64.4 78.1

89.9 74.2 86.0 96.4 98.6 89.9 94.2 100.0

88.9 72.5 86.1 97.9 99.7 90.1 94.3 100.0

5.2. Experimental results The experimental results of the ﬁrst phase are presented in Fig. 3. The threshold values – the parameter of the stopping criterion – were set as follows: 1, 0.99, 0.95, 0.90, 0.85, 0.80, 0.75, 0.70, 0.65, 0.60. For these thresholds we present prediction accuracy and the reduction degree of test costs (Gain). These measures are deﬁned as follows:

Accuracy = (n/N) ∗ 100 Tc Gain = 1 − ∗ 100 T

sOBC included examples represented precisely (i.e., described by speciﬁc values of all attributes). On the contrary, testing sets initially included examples represented by missing values only (i.e., a root value for each AVO). Their representation precision was sequentially increased until a stopping criterion was activated or classiﬁed examples were represented precisely for each attribute. We have chosen m-estimation of probabilities (4) because this estimation performed better than Laplace estimation (3) in our experiments [3]. The parameters of m-estimation were the following: pa = nC j /N, m = 0.0001, where nC j is the number of training examples with class label Cj , N is the number of training examples. The experimental design involved two phases. The ﬁrst phase was focused on examining the behaviuor of our method apllied to the datasets with AVOs (sOBCAVO ) with different threshold values on the ﬁnding the best threshold values that reduce test costs while maintaining the classiﬁcation accuracy. In our experimental study we assumed that all tests had the same cost, thus our method could be considered as a speciﬁc approach to feature selection in the setting with trivial AVO. Therefore, in the second phase we compared our approach (sOBC) with two popular feature selection schemes: ﬁlter and wrapper.

(11)

where: n is the number of correctly classiﬁed test examples, N is the number of test examples; Tc is the cost of tests conducted with the stopping criterion, T is the cost of tests conducted without activating the stopping criterion, Tc ≤ T. These measures are presented on different scales (70–100% for Accuracy on the left axis, 0–100% for Gain on the right axis). We have decided to use different scales and axes in order to show a more detailed insight into the changes of the Accuracy measure. Let us indicate that for the threshold value equal to 1, the stopping criterion is not activated and the sequential classiﬁcation is conducted as long as the representation of a classiﬁed example contains abstract values. Gain in such a case is equal to 0 (Tc = T). The prediction accuracy for this threshold is determined for precisely represented classiﬁed examples and is used as a reference point for the prediction accuracies obtained for smaller thresholds, when stopping criterion breaks the sequential classiﬁcation process. The results presented in Fig. 3 conﬁrm our hypothesis that it is possible to reduce the cost of tests while maintaining prediction accuracy. The general observation is the following: decreasing the threshold value we increase Gain. At the same time the prediction accuracy is maintained for a certain threshold range, and then begins to decrease. Thus, we are able to indicate such a threshold value tv∗ for which we get the maximum Gain while maintaining prediction accuracy. These threshold values for each dataset are given in Table 2.

Table 3 The comparison of sOBC with standard feature selection schemes. Data set Audiology t v∗ = 0.99 Breast t v∗ = 0.6 Car t v∗ = 0.65 Dermatology t v∗ = 0.99 Mushroom t v∗ = 0.6 Nursery t v∗ = 0.8 Soybean t v∗ = 0.99 Zoo t v∗ = 0.8

Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%]

sOBCt v=1

sOBCt v=t v∗

sOBCt v=0.95

NB

NBF

NBW

88.9 0 72.5 0 86.1 0 97.9 0 99.7 0 90.1 0 94.3 0 100 0

89.6 77.1 73.2 77.5 86.1 39.3 96.2 65.1 99.4 90.8 90.1 39.5 94.5 64.2 100 81.5

87.81 81.60 72.5 10.1 86.1 33.3 95.7 68.3 99.8 89.1 90.1 26.2 93.0 70.0 100 81.4

87.0 0 71.1 0 85.5 0 97.3 0 95.8 0 90.2 0 92.2 0 97.6 0

84.9 88.4 73.0 54.2 85.5 0 98.1 44.4 98.5 81.8 90.1 4.9 92.0 36.3 99.6 45.5

84.0 88.7 72.3 63.4 85.3 8.8 97.2 62.6 99.6 85.0 90.3 0 92.8 52.7 96.8 63.7

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

41

Fig. 3. Experimental evaluation of sOBCAVO for different tv values – the parameter of the stopping criterion.

O The prediction accuracy for these threshold values (sOBCtAV v=t v∗ ) and AV O for the threshold value equal to 1 (sOBCt v=1 ) are also given in Table 2. We can observe that it is possible to obtain a very signiﬁcant reduction of test costs while maintaining prediction accuracy. For example, for the data set Breast the reduction is equal 90.1%, which means, that we need to conduct 9.9% of tests only. For two datasets (Der-

matology and Mushroom), even a very high threshold value equal to 0.99 causes the deterioration of prediction accuracy. However, for these datasets we can observe at the same time a signiﬁcant Gain for this threshold value. At the end, let us notice that for some datasets (Audiology, Breast) we are even able to increase the prediction accuracy for some threshold values. Let us notice, that it is possible to

42

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

indicate a ﬁxed threshold value (0.95), for which we may obtain quite signiﬁcant Gain with rather small deterioration of the prediction accuracy for each dataset. In our experimental study we assumed that all tests had the same cost, thus our method could be considered as a speciﬁc approach to feature selection in the setting with trivial AVO. Therefore, in the second phase we compared our approach (sOBC) with two popular feature selection schemes: ﬁlter and wrapper. We have used Weka software [35] to run these programs. We have selected correlation based selection (CfsSubsetEval) with greedy search (BestFirst) as a ﬁlter (NBF ) and internal cross validation (WrapperSubsetEval) with greedy search (BestFirst) as a wrapper (NBW ). Naive Bayes implemented in Weka was used as a classiﬁer for the ﬁltered training data and also inside the wrapper. The results of the second phase are given in Table 3. The results reveal that wrapper works better than ﬁlter in terms of Gain (the only one exception was observed for the Nursery data set). Our method with optimized threshold value (sOBCt v=t v∗ ) offered the improved Gain in comparison to wrapper (the only exception was observed for the Audiology data set). Moreover, our method with a ﬁxed threshold value (sOBCt v=0.95 ) offered also the improved Gain in comparison to wrapper (the only exception was observed for the Audiology and Breast data bases).

6. Conclusions We have proposed a novel approach to the problem of classiﬁcation with test costs, that is the costs of obtaining attribute values of classiﬁed examples. We assume that attribute values are represented at different levels of abstraction and model domain background knowledge. The proposed method sequentially explores levels of abstraction during classiﬁcation. This allows us to reduce test costs while maintaining the prediction accuracy. In the experimental evaluation of our method on 8 UCI data sets we have been able to reduce test costs at least up to 48% while maintaining the prediction accuracy. We admit that this observable reduction requires tuning the threshold value (considered in the stopping condition) separately for each data set. However, we have observed it has been possible to obtain acceptable results for the ﬁxed threshold value of 0.95. This value can be used as the rule of a thumb for new data sets. We have also considered our method in the setting without AVO and with uniform test costs. Then it boils down to feature selection similar to incremental wrapper described in [36], although more local (limited to the context of a single classiﬁed example). Experimental comparison to ﬁlter and wrapper schemes demonstrates that for the majority of the considered data sets our method has yielded better results (smaller set of selected features while maintaining the prediction accuracy) than both competitive schemes. As part of future work we are going to explore our approach in the classiﬁcation with different test costs. Moreover, we are going to examine the effect of more complex AVOs (e.g., with larger number of levels) on the test costs reduction.

Acknowledgments The authors are grateful for the anonymous reviewers’ insightful comments and valuable sugestions, which have substantially improved the quality of this paper. References [1] V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos, Recent advances and emerging challenges of feature selection in the context of big data, Knowl.-Based Syst. 86 (2015) 33–45. [2] M. Saar-Tsechansky, F.J. Provost, Handling missing values when applying classiﬁcation models, J. Mach. Learn. Res. 8 (2007) 1623–1657.

[3] T. Łukaszewski, J. Józefowska, A. Lawrynowicz, L. Józefowski, A. Lisiecki, Controlling the prediction accuracy by adjusting the abstraction levels, in: Proceedings of the 6th International Conference on Hybrid Artiﬁcial Intelligent Systems, 2011, pp. 288–295. [4] P.D. Turney, Cost-sensitive classiﬁcation: Empirical evaluation of a hybrid genetic decision tree induction algorithm, J. Artif. Intell. Res. 2 (1995) 369–409. [5] S. Zhang, Z. Qin, C.X. Ling, S. Sheng, missing is useful: missing values in costsensitive decision trees, IEEE Trans. Knowl. Data Eng. 17 (12) (2005) 1689–1693. [6] J. Zhang, D. Kang, A. Silvescu, V. Honavar, Learning accurate and concise naïve Bayes classiﬁers from attribute value taxonomies and data, Knowl. Inf. Syst. 9 (2) (2006) 157–179. [7] M. Ye, X. Wu, X. Hu, D. Hu, Knowledge reduction for decision tables with attribute value taxonomies, Knowl.-Based Syst. 56 (2014) 68–78. [8] K. Pancerz, A. Lewicki, Encoding symbolic features in simple decision systems over ontological graphs for PSO and neural network based classiﬁers, Neurocomputing 144 (2014) 338–345. [9] K. Pancerz, A. Lewicki, R. Tadeusiewicz, Ant-based extraction of rules in simple decision systems over ontological graphs, Appl. Math. Comput. Sci. 25 (2) (2015) 377–387. [10] T. Łukaszewski, S. Wilk, Sequential classiﬁcation by exploring levels of abstraction, in: Proceedings of the 18th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, 2014, pp. 309–317. [11] Z. Qin, C. Zhang, T. Wang, S. Zhang, Cost sensitive classiﬁcation in data mining, in: Proceedings of the 6th International Conference on Advanced Data Mining and Applications, 2010, pp. 1–11. [12] S. Lomax, S. Vadera, A survey of cost-sensitive decision tree induction algorithms, ACM Comput. Surv. 45 (2) (2013) 16. [13] S. Norton, Generating better decision trees., in: Proceedings of International Joint Conference on Artiﬁcial Intelligence, IJCAI, 89, 1989, pp. 800–805. [14] M. Núñez, The use of background knowledge in decision tree induction, Mach. Learn. 6 (3) (1991) 231–250. [15] M. Tan, Cost-sensitive learning of classiﬁcation knowledge and its applications in robotics, Mach. Learn. 13 (1) (1993) 7–33. [16] J.V. Davis, J. Ha, C.J. Rossbach, H.E. Ramadan, E. Witchel, Cost-sensitive decision tree learning for forensic classiﬁcation, in: Proceedings of the 17th European Conference on Machine Learning, 2006, pp. 622–629. [17] R. Greiner, A.J. Grove, D. Roth, Learning cost-sensitive active classiﬁers, Artif. Intell. 139 (2) (2002) 137–174. [18] V.B. Zubek, T.G. Dietterich, Pruning improves heuristic search for cost-sensitive learning, in: Proceedings of the Nineteenth International Conference on Machine Learnig, ICML, 2002, pp. 19–26. [19] C.X. Ling, Q. Yang, J. Wang, S. Zhang, Decision trees with minimal costs, in: Proceedings of the Twenty-ﬁrst International Conference on Machine Learning, ICML, 2004, pp. 69–76. [20] X. Chai, L. Deng, Q. Yang, C.X. Ling, Test-cost sensitive naive bayes classiﬁcation, in: Proceedings of the 4th IEEE International Conference on Data Mining, ICDM, 2004, pp. 51–58. [21] S. Sheng, C.X. Ling, A. Ni, S. Zhang, Cost-sensitive test strategies, in: Proceedings of the Twenty-First National Conference on Artiﬁcial Intelligence and the Eighteenth Innovative Applications of Artiﬁcial Intelligence Conference, 2006, pp. 482–487. [22] S. Lomax, S. Vadera, M. Saraee, A multi-armed bandit approach to cost-sensitive decision tree learning, in: Proceedings of the 12th IEEE International Conference on Data Mining Workshops, 2012, pp. 162–168. [23] Z. Qin, S. Zhang, C. Zhang, Cost-sensitive decision trees with multiple cost scales, in: Proceedings of the 17th Australian Joint Conference on Artiﬁcial Intelligence, AI 2004, 2004, pp. 380–390. [24] A. Ni, X. Zhu, C. Zhang, Any-cost discovery: learning optimal classiﬁcation rules, in: Proceedings of the 18th Australian Joint Conference on Artiﬁcial Intelligence, 2005, pp. 123–132. [25] S. Zhang, Cost-sensitive classiﬁcation with respect to waiting cost, Knowl.-Based Syst. 23 (5) (2010) 369–378. [26] S. Zhang, Decision tree classiﬁers sensitive to heterogeneous costs, J. Syst. Softw. 85 (4) (2012) 771–779. [27] S. Lomax, S. Vadera, An empirical comparison of cost-sensitive decision tree induction algorithms, Expert Syst. 28 (3) (2011) 227–268. [28] J. Yang, V. Honavar, Feature subset selection using a genetic algorithm, IEEE Intell. Syst. 13 (2) (1998) 44–49. [29] C. Huang, C. Wang, A ga-based feature selection and parameters optimization for support vector machines, Expert Syst. Appl. 31 (2) (2006) 231–240. [30] V. Bolón-Canedo, I. Porto-Díaz, N. Sánchez-Maroño, A. Alonso-Betanzos, A framework for cost-based feature selection, Pattern Recognit. 47 (7) (2014) 2481–2489. [31] Z.E. Xu, M.J. Kusner, K.Q. Weinberger, M. Chen, O. Chapelle, Classiﬁer cascades and trees for minimizing feature evaluation cost, J. Mach. Learn. Res. 15 (1) (2014) 2113–2144. [32] T. Łukaszewski, J. Józefowska, A. Ławrynowicz, L. Józefowski, Handling the description noise using an attribute value ontology, Control Cybern. 40 (2011) 275– 292. [33] B. Cestnik, I. Bratko, On estimating probabilities in tree pruning, in: Proceedings of European Working Session on Learning, EWSL, 1991, pp. 138–150. [34] C. Lee, A gradient approach for value weighted classiﬁcation learning in naive bayes, Knowl.-Based Syst. 85 (2015) 71–79. [35] R.R. Bouckaert, E. Frank, M.A. Hall, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, WEKA – experiences with a java open-source project, J. Mach. Learn. Res. 11 (2010) 2533–2541. [36] P. Bermejo, J.A. Gámez, J.M. Puerta, Speeding up incremental wrapper feature subset selection with naive bayes classiﬁer, Knowl.-Based Syst. 55 (2014) 140–147.

Classification with test costs and background knowledge

Classification with test costs and background knowledge

Recommend Documents