Classification with test costs and background knowledge

Classification with test costs and background knowledge

Knowledge-Based Systems 92 (2016) 35–42 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/...

1MB Sizes 3 Downloads 77 Views

Knowledge-Based Systems 92 (2016) 35–42

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Classification with test costs and background knowledge Tomasz Łukaszewski∗, Szymon Wilk Computer Science Department, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland

a r t i c l e

i n f o

Article history: Received 30 April 2015 Revised 30 September 2015 Accepted 7 October 2015 Available online 22 October 2015 Keywords: Test costs Levels of abstraction Naïve Bayes classifier

a b s t r a c t We propose a novel approach to the problem of the classification with test costs understood as costs of obtaining attribute values of classified examples. Many existing approaches construct classifiers in order to control the tradeoff between test costs and the prediction accuracy (or misclassification costs). The aim of the proposed method is to reduce test costs while maintaining of the prediction accuracy of a classifier. We assume that attribute values are represented at different levels of abstraction and model domain background knowledge. Our approach sequentially explores these levels during classification – in each iteration it selects and conducts a test that precises the representation of a classified example (i.e., acquires an attribute value), invokes a naïve Bayes classifier for this new representation and checks the classifier’s outcome to decide whether this iterative process can be stopped. The selection of the test in each iteration takes into account the possible improvement of the prediction accuracy and the cost of this test. We show that the prediction accuracy obtained for classified examples represented precisely (i.e., when all the tests have been conducted and all specific attribute values have been acquired) can be achieved for a much smaller number of tests (i.e., when not all specific attribute values have been acquired). Moreover, we show that without levels of abstraction and with uniform test costs our method can be used for selecting features and it is competitive to popular feature selection schemes: filter and wrapper. © 2015 Elsevier B.V. All rights reserved.

1. Introduction One of the main tasks of machine learning is to build classifiers from available data. Constructed classifiers, after their evaluation, are applied in many real-world applications, in medical diagnosis, automated testing, robotics, industrial production processes and many other areas. The most commonly used evaluation criterion of a classifier is its predictive accuracy. The measure of the prediction accuracy of a classifier is often replaced by the measure of the misclassification costs of a classifier, because different errors may have different costs. On the other hand, more and more attention is paid to test costs, that is the cost of obtaining attribute values (features) of classified examples. The cost associated to a feature can be related to different concepts: expenses, risks or computational costs [1]. In order to decrease the total cost of these tests we may reduce their number allowing for missing attribute values in the representation of classified examples. However, missing values of relevant attributes in the representation of classified examples usually degrade the predictive accuracy of a classifier (or increases the misclassification costs of a classifier) [2,3]. Therefore, we have to decide which tests should be carried out in order to control the tradeoff between the cost of these tests and



Corresponding author. Tel.: +48 616652920. E-mail address: [email protected] (T. Łukaszewski).

http://dx.doi.org/10.1016/j.knosys.2015.10.008 0950-7051/© 2015 Elsevier B.V. All rights reserved.

the accuracy of a classifier (or the tradeoff between test costs and misclassification costs of a classifier) [4,5]. However, in many real applications it is very difficult to evaluate misclassification costs. For example, in medical diagnosis, how much money you should assign for a misclassification cost, when a misclassification hurts a patient’s life? In such cases, we should concentrate on the tradeoff between test costs and the accuracy of a classifier. The appropriate approach may be to reduce test costs while maintaining the prediction accuracy of a classifier. This goal may be achieved by cost-based feature selection methods [1]. Let us notice that standard feature selection methods were designed to handle plain data without any type of generalization of attribute values. However, there are areas where attribute values are represented at different levels of abstraction. These levels model domain background knowledge and have usually a form of a tree-like hierarchy. In such a tree the root represents a missing value, leaves represent specific attribute values, and the remaining nodes represent abstract attribute values (e.g., sets of specific values). Importantly, such hierarchies entail the existence of tests that replace a more abstract value by a less abstract value. Moreover, this replacement may be taken in several stages for a given attribute, going from the root of a hierarchy towards less abstract values. We assume that for some classified data a decision may be taken based on (less or more) abstract attribute values. Without levels of abstraction, precise attribute values had to be acquired in such a case. Assuming that the

36

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

cost of obtaining an abstract value is less than the cost of obtaining a precise value that is a refinement of this abstract value, we see that introducing levels of abstraction should allow to further reduce test costs. Moreover, the exploration of these levels of abstraction in the context of a classified example should result in lower test costs than their exploitation during learning or even earlier, during the data preprocessing. Unfortunately, the approaches proposed so far that take into account levels of abstraction or more general ontologies are intended to obtain only models that are simpler and their classification accuracy is preserved or improved (test costs are not considered) (e.g., [6–9]). In this paper we present a novel approach to the problem of a classification with test costs. Our approach sequentially explores levels of abstraction during classification – in each iteration it selects and conducts a test that precises the representation of a classified example (i.e., acquires an attribute value), invokes a naïve Bayes classifier for this new representation and checks the classifier’s outcome to decide whether this iterative process can be stopped. The selection of the test in each iteration takes into account the possible improvement of the prediction accuracy and the cost of this test. We show that the prediction accuracy obtained for classified examples represented precisely (i.e., when all the tests have been conducted and all specific attribute values have been acquired) can be achieved for a much smaller number of tests (i.e., when not all specific attribute values have been acquired). Moreover, we show that without levels of abstraction and with uniform test costs our method can be used for selecting features and it is competitive to popular feature selection schemes: filter and wrapper. The novelty of the paper is twofold. First, the stopping criterion of this sequential process explores the classifier outcome for the current and previous stages of the sequential process. Second, our approach allows for representing attribute values at different levels of abstraction in order to model domain background knowledge. These two elements allow to achieve the aim of our research. The method presented in the paper is based on the results of our earlier works. In [3] we showed that missing values of attributes with small value of information gain does not reduce prediction accuracy. In [10] we showed that the prediction accuracy of the sequential classification process, that is applied in the paper converges very quickly to the prediction accuracy achieved for the examples represented precisely. The method presented in the paper adds the stopping criterion to this sequential classification and presents the experimental evaluation of the proposed approach. The paper is organized as follows. Section 2 recalls the existing approaches to the problem of classification with test costs. Section 3 presents the idea of representing background knowledge (attribute values and tests) by levels of abstraction. It also describes a naïve Bayes classifier generalized to these levels of abstraction. Section 4 describes the concept of sequential classification and the stopping criterion for this strategy. Section 5 shows the results of the experimental evaluation of the proposed method. Section 6 concludes the paper. 2. Related works The detailed review of algorithms that take into account test costs and/or misclassification costs are presented in [11,12]. However, not all the algorithms consider the aforementioned tradeoff. Thus, we intend to indicate these approaches, where this tradeoff is considered. The problem of the tradeoff between the cost of tests and the accuracy of a classifier was considered in [13] (IDX), [14] (EG2), [15] (CS-ID3) and [16] (Clarify). All these approaches combine information gain and test costs in order to construct decision trees. The problem of a tradeoff between the cost of tests and the misclassification cost of a classifier also was extensively analyzed. In [4] a system called ICET, which uses a genetic algorithm to build a

decision tree to minimize the cost of tests and misclassifications was presented. In [17] the theoretical aspects of active learning with test costs using a PAC learning framework were studied. It is a theoretical work on a dynamic programming algorithm searching for best diagnostic policies measuring at most a constant number of attributes. The obtained result is not applicable in practice, because it requires a predefined number of training data in order to obtain suboptimal policies. In [18] an algorithm based on formulating the classification process as a Markov Decision Process (MDP), whose optimal policy gives the optimal diagnostic procedure was presented. While related to other work, this approach may incur very high computational cost to conduct the search. In [19] a tree-building strategy was proposed that uses minimum cost of tests and misclassifications as the attribute split criterion. In [20] a naïve Bayesian based cost-sensitive learning algorithm, called csNB was proposed in order to minimize the sum of test costs and misclassification costs. In [21] tree-building strategies were proposed: sequential test strategy, single batch strategy and multiple batch strategy. The comparison of these strategies showed that the total cost of the sequential test strategy is the lowest. In [22] a framework based on game theory was employed in order to build a cost-sensitive decision tree. The empirical evaluation of the proposed algorithm showed that it is possible to induce decision trees that maintain prediction accuracy but also minimize test and misclassification costs. However, there are a number of parameters which can be set in order to change the behavior of the algorithm in response to the differing test costs and misclassification costs. In [23– 26] the problem of cost-sensitive classification with multiple cost scales was considered. The empirical comparison of cost-sensitive decision tree induction algorithms was presented in [27]. This comparison took into account 30 algorithms, which can be organized into 10 categories. The lowest cost is produced by a system ICET. It was indicated that high accuracy rates do not always mean low classification costs. Moreover, having an inexpensive decision tree does not automatically mean that it is an accurate decision tree. The problem of reducing test costs while maintaining the prediction accuracy of a classifier is also considered in the context of costbased feature selection. Methods that can deal with large-scale and real-time applications are urgently needed since costs must be budgeted and accounted for [1]. In [28] a genetic algorithm was used to perform feature selection where the fitness function combined two criteria: the accuracy of the classification realized by the neural network and the cost of performing the classification. In [29] a similar approach was presented, where a genetic algorithm is used in feature selection and parameters optimization for a support vector machine. The fitness function aggregated classification accuracy, the number of selected features and the feature cost. However, the above mentioned methods have the disadvantage of being computationally expensive. Therefore, a modification of a filter model, which is known to have a low computational cost was proposed in [30]. The presented modification adds to the features evaluation function a term to take into account the cost of the features. In [31] two main components of test-time CPU cost were examined (i.e., classifier evaluation costs and feature extraction costs) and it was shown how to balance these costs with classification accuracy.

3. Representing background knowledge by levels of abstraction Let us notice that there are areas where attribute values are represented at different levels of abstraction. These levels model domain background knowledge and have usually a form of a tree-like hierarchy [6]. In such a tree the root represents a missing value, leaves represent specific attribute values, and the remaining nodes represent abstract attribute values (e.g., sets of specific values). Importantly, such hierarchies entail the existence of tests that replace a more abstract value by a less abstract value. Moreover, this replacement may

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

37

Infectious Agent (t1 )

Bacteria (t2 )

Gram-positive

Gram-negative

Bacteria (t3 )

Bacteria (t4 )

Streptococcous

E.Coli

Salmonella

Fungi

Virus

Fig. 1. Example of an attribute value ontology.

Infectious Agent (t0 )

Streptococcous

E.Coli

Salmonella

Fungi

Virus

Fig. 2. Example of a trivial attribute value ontology.

be taken in several stages for a given attribute, going from the root of a hierarchy towards less abstract values. We assume that for some classified data a decision may be taken based on (less or more) abstract attribute values. Without levels of abstraction, precise attribute values had to be acquired in such a case. Assuming that the cost of obtaining an abstract value is less than the cost of obtaining a precise value that is a refinement of this abstract value, we see that introducing levels of abstraction should allow to reduce test costs. In order to formally represent attribute values at different levels of abstraction we introduce an attribute value ontology (AVO) [32].

cepts are the following: Infectious Agent, Bacteria, Gram-positive Bacteria, Gram-negative Bacteria. The concept Infectious Agent is interpreted as a missing attribute value. Let us explain, that AVO can be build even then, when a set of abstract concepts is restricted to the missing value. In such a case we have a trivial hierarchy – a tree of a height equal to 1. In the context of our medical problem, such a trivial hierarchy is presented in Fig. 2.

3.2. Representation of tests 3.1. Attribute value Ontology Definition 1. Given is an attribute A and a set V = {v1 , v2 , . . . , vn }, n > 1, of specific values of this attribute. An attribute value ontology is a pair A = CA , R, where: CA is a set of concepts, R is a subsumption relation over CA . Subset CP ⊆ CA of concepts without subconcepts is a finite set of atomic concepts of A. Atomic concepts CP represent the specific values of A. Abstract concepts CA ࢨCP represent more general (imprecise) values of A. In general, AVO can be a directed acyclic graph. For the simplicity of a presentation, in this paper we restrict the above definition by assuming that each concept has at most one direct superconcept (a parent) and direct subconcepts (children) of each concept are mutually exclusive. Such AVO has a tree structure. The root of this tree is interpreted as a missing attribute value. Example 1. Let us consider the following medical problem. In order to determine the correct treatment, an agent that caused the infection needs to be specified. A hierarchy describing the domain of infectious agents is presented in Fig. 1. Although all viral infections determine the same treatment (like infections caused by fungi), identification of the bacteria type is important in order to decide about the appropriate treatment. Thus, specific values of this attribute are the following: Streptococcus, E.Coli, Salmonella, Fungi, Virus. Atomic concepts from hierarchy represent these specific values, abstract con-

The existence of AVO entails the existence of tests that replace an abstract concept by a less abstract concept (a descendant of this abstract concept). These tests may be modeled as a function t: CA ࢨCP → CA . However, in the paper we assume that each test replaces an abstract concept by a one of its direct subconcepts (a child of this abstract concept). Following this assumption, in order to replace an abstract concept by a more precise concept (atomic or abstract one), the number of required tests is equal to the length of the path between these two concepts. In the context of our medical example, we have the following tests: (t1 ) Infectious Agent is Bacteria or Fungi or Virus, (t2 ) Bacteria is Gram-positive or Gram-negative, (t3 ) Gram-positive Bacteria is Streptococcous or not, (t4 ) Gram negative Bacteria is E.Coli or Salmonella. Therefore, in order to reveal, that the Infectious Agent is Streptococcous, we have to carry out three tests: t1 , t2 and t3 . Considering trivial AVO, in the context of our medical example, there would be only one test: (t0 ) Infectious Agent is Streptococcous or E.Coli or Salmonella or Fungi or Virus. In the real world each test has its own cost in terms of time, money, impact on patients’ health or other units. Moreover, in most practical problems, tests at lower levels of AVO are more elaborated and their cost is higher than the cost of tests at higher levels of AVO (e.g., in the context of our medical example cost(t1 ) < cost(t2 ) < cost(t3 )). The cost of achieving a specific value in the setting with AVO may be more expensive than the cost in the setting with trivial AVO (e.g., (cost (t1 ) + cost (t2 ) + cost (t3 )) > cost (t0 ).

38

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

However, we assume that for some classified data a decision may be taken based on a (less or more) abstract attribute value in AVO instead on a precise value (a refinement of this abstract value) in trivial AVO. Assuming that the cost of obtaining an abstract value in AVO is less than the cost of obtaining the precise value in trivial AVO, we see that introducing AVO should allow to reduce test costs. In the context of our medical example, assuming that a specific value is Streptococcous and the decision may be taken based on abstract value Gram-positive Bacteria in AVO and (cost (t1 ) + cost (t2 )) < cost (t0 ), we may reduce test costs. For the simplicity of the presentation, we assume in the paper that each test has the same cost (e.g., in terms of expenses). We are going to consider different costs in our future work.

where nci ,C j and nck ,C is the number of training examples chari

j

acterized by concepts ci or cik respectively and class label Cj , and desc(ci , Ai ) is the set of concepts that are descendants of the concept ci in Ai . If we assume cik is an atomic refinement of ci (i.e., cik is an atomic concept and a descendant of ci ), then we can draw an analogy between formula (5) and value weights introduced in [34]. Specifically,

P (ci |C j ) = P (cik |C j )wi j k

(6)

where wkij depends on the distance between ci and the root of Ai . If ci constitutes the root (i.e., it represents a missing value), then wkij is equal to 0.0, and if ci represents an atomic concept, then wkij is equal to 1.0. Otherwise, wkij belongs to the interval (0.0, 1.0).

3.3. Ontological Bayes classifier In [32] we showed how to extend the naïve Bayes classifier to handle AVO. We call this classifier an ontological Bayes classsifer (OBC). Below the idea of this extension is presented. First, we remind the definition of the naïve Bayes classifier. We assume that given is a set of m attributes A1 , A2 ,…, Am , m > 0. An example is represented by a vector (v1 , v2 , . . . , vm ), where vi is the specific value of Ai . Let C represent the class variable and Cj ( j = 1 . . . k, k > 1) represent its possible value (a class label). The naïve Bayesian classifier assumes that the attributes are conditionally independent given the class variable, which gives us:

P (C j |v1 , v2 , . . . , vm ) ∝ P (C j )



P (vi |C j )

(1)

i

where P(vi |Cj ) is the probability of an example from class Cj having the observed value of attribute Ai equal to vi . The probabilities in the above formula may be estimated from training examples, e.g. using relative frequency:

P (C j ) =

nC j N

P (vi |C j ) =

nvi ,C j

(2)

nC j

where N is the number of training examples, nC j is the number of training examples with class label Cj , nvi ,C j is the number of training examples with the value vi of an attribute Ai and class label Cj . In order to make the estimates P(vi |Cj ) robust with respect to infrequent data, it is common to use Laplace estimates:

P (vi |C j ) =

nvi ,C j + 1

(3)

nC j + Vi

where Vi is the total number of values of attribute Ai . Another approach to estimate P(vi |Cj ) is m-estimation [33]:

P (vi |C j ) =

nvi ,C j + pa m

(4)

nC j + m

where pa is a priori probability of Cj and m is a parameter of the estimation method. After augmenting the representational language by AVO, the naïve Bayes classifier needs to be extended to estimate P(ci |Cj ), where ci is an atomic or abstract concept from Ai . The probabilities in the above formula may be estimated from training examples, e.g. using relative frequency. The proposed extension count training examples with the value ci and more precise values in Ai and class label Cj . Let us recall, that for a given concept ci from Ai , all the concepts that are more precise than ci are the descendants of ci . In order to estimate P(ci |Cj ), by relative frequency, we use the following property, which gives us:

P (ci |C j ) =

nci ,C j +



cik ∈desc(ci ,Ai )

nC j

nck ,C j i

(5)

4. Sequential classification When tests are cheap it may be rational to conduct all tests (i.e., to determine the values of all attributes). In this kind of situation, it is convenient to separate the selection of tests from the process of making classification. When tests are expensive the interleaving the selection of tests and making a classification should give the lowest sum of costs. The outcome of the test gives us information that we can use in order to select the next test or decide that the cost of further tests is not justified and stop testing [4]. Such an approach is very often observable in the real-world. Let us imagine a scenario, when a patient with some worrying signs and symptoms should be diagnosed and these symptoms are not sufficient to make an accurate diagnosis. Hence, some tests should be conducted in order to obtain a more precise description of the patient’s state. At each step of this sequential process, the results from previous steps determine whether a further test is required to gather more information. Moreover, the selection of the next test should take into account the cost of this test and the possible improvement of prediction accuracy (or the reduction of misclassification costs). The proposed method sequentially performs the following tasks: selects and conducts a test that precises the representation of a classified example, calculates a new outcome of OBC for this new representation and verifies whether this iterative process can be stopped. Based on [3,10] we argue, that this sequential classification with the properly constructed stopping criterion of this process should allow to reduce test costs while maintaining prediction accuracy. The pseudocod of this sequential ontological Bayes classifier (sOBC) is presented in Listing 1. The test selection and stopping criterion are defined in the next subsections. The sOBC starts with an initial representation of a classified example E. This initial representation is marked as E0 (lines 1 and 2) and the outcome of OBC for E0 is calculated for all Cj (line 3). While the current representation of a classified example Ei contains abstract values (line 4) there is a possibility to select a test ti (lines 5 and 6) and refine the representation precision of a classified example (line 7). In the paper we assume, that each test replaces an abstract concept by a one of its direct subconcepts (see Section 3.2). For this new refined representation the outcome of OBC is calculated for all Cj (line 8). If the stopping criterion is activated then sOBC stops and returns the last outcome of OBC (line 11); else sOBC try to further refine the representation of Ei (line 4). If the current representation of a classified example Ei does not contain abstract values (line 4) sOBC also stops and returns the last outcome of OBC (line 11). Let us notice, that for a classified example that initial representation does not contain abstract values, sOBC calculate the outcome of OBC (line 3) and stops (line 11).

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

Algorithm 1 sequential ontological Bayes classifier (sOBC) Input: a classified example E Output: a classification result of E after conducting i tests: P(Cj |Ei ) 1. i = 0 2. E0 = E 3. calculate the outcome of OBC: P(Cj |E0 ) 4. while the representation of Ei contains abstract values do 5. i = i + 1 6. select the test ti 7. Ei = t i (E i−1 ) 8. calculate the outcome of OBC: P(Cj |Ei ) 9. if the stopping criterion is activated then break 10. end while 11. return P(Cj |Ei )

39

level of precision we have checked the outcome of OBC (the probability of the most probable class). We have noticed, that during this sequential classification of a test case, the most probable class changes only a few times. Moreover, the change of the most probable class is very often associated with the decrease of the probability of the most probable class. On the other hand, the subsequent outcomes pointing at the same most probable class are usually associated with increasing values of the probability. To account for this observation, we have introduced a stopping criterion that considers the most probable classes and the associated probabilities in two subsequent steps. The proposed criterion stops the sequential classification process after obtaining two subsequent OBC outcomes C i−1 , Ci of the same class label:

C i−1 = arg max j

C i = arg max j

P (C j |E i−1 ) P (C j |E i )

(9)

C i−1 = C i

4.1. Test selection

with the probability greater than a threshold value tv: Let us assume that for a current step of a sequential classification a classified example E is represented by a vector (c1 , c2 ,…, cn ), where ci is a concept from Ai (i.e., a concept of AVO associated with attribute Ai ). In this paper we assume that for the selected Ai the representation precision of E is improved by indicating a concept cik that is a child of ci in Ai . Therefore, the selection of a test reduces to the selection of Ai for which we can precise a concept by going down one level in the hierarchy. We propose to apply the information gain measure to select Ai at each step of the sequential classification. Let Si is a set of training examples that are represented using the concept ci of Ai or the descendants of ci . We define a measure M as follows:

M(E, Ai ) = Entropy(Si ) −

 cik ∈children(ci ,Ai )

|Sik | Entropy(Sik ) |Si |

(7)

where children(ci , Ai ) is the set of concepts that are children of the concept ci in Ai , and Sik is the subset of Si of training examples that are represented using the concept cik or its descendants. The sequential classification considered in the paper selects at each step Ai with the highest value of the measure M. Recall that the assumption of the paper is that each test has the same cost. For the setting with different test costs, a new measure should be applied. For example a measure that takes into account the information gain and test costs was proposed in [13]:

M(E, Ai , ti ) = M(E, Ai )/cost (ti )

(8)

where cost(ti ) is the cost of a test ti that makes a concept ci in Ai more precise. 4.2. Stopping criterion The second key element of the sequential classification is the stopping criterion. Without such a stopping criterion sOBC would be stopped by reaching an atomic concept (a specific attribute value) for each attribute. Consequently, all tests would be carried out. Let us remind that our goal is to reduce the cost of required tests. Let us notice that we have an access to a classification result for the current and previous steps of sOBC for a given classified example. We assume, that the outcomes of OBC (probabilities) are normalized. The sequence of these results can be explored in order to define the stopping criterion. For each test case we have explored subsequent levels of representation precision starting from the most general representation (values of all attributes are unknown) to the most precise representation (specific attribute values of all attributes are known). For each

P (C i−1 |E i−1 ) > t v P (C i |E i ) > t v

(10)

The introduction of the outcome confirmation was necessary in order to avoid a single outcome with the probability above the threshold value with wrong class label. We expect, that the smaller the value of the threshold the earlier the stopping criterion would be activated and the greater reduction of the number of performed tests would be observed. However, too excessive reduction of tests may lead to a significant deterioration of prediction accuracy. Therefore, the threshold value should be set experimentally (see Section 5 for details). 5. Evaluation of the proposed method The experiment was aimed at examining the prediction accuracy and the reduction of the test costs for different thresholds. Moreover, we have compared our approach with two popular feature selection schemes: filter and wrapper. 5.1. Experimental design We used 8 from 37 data sets considered by Zhang et al. in their studies on applying attribute value taxonomies (AVT) in naïve Bayes classifier [6]. Such selection has resulted from limitations of our computational platform that does not handle numerical attributes. We also employed AVTs developed by Zhang and colleagues for these data sets. Three datasets among these selected (Mushroom, Soybean and Nursery) have attribute value taxonomies (AVTs) supplied by domain experts. For the remaining five datasets (Audiology, Breast Cancer, Car Evaluation, Dermatology, Zoo) we have considered AVTs generated by Zhang et al. using their AVT-Learner tool. AVT is the simplest form of AVO, that is also considered in this paper. Before running the experiment, we preprocessed the data sets by removing decision classes with less than 10 examples (such heavily underrepresented classes may have affected the evaluated performance). Table 1 lists the data sets and their final characteristics – the data sets that were affected during preprocessing are marked with ‘∗’ and their original characteristics are given in brackets. In this table we also indicate the class imbalance in the considered data sets by reporting the percentage of examples in most frequent and least frequent class. Finally, for each data set we list the number of non-trivial AVOs. Note that all data sets, except the Car Evaluation data set, are associated with a mixture of non-trivial AVOs and trivial AVOs. The experimental design relied on the stratified 10-fold cross validation, which was repeated 10 times. Learning sets used to develop

40

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42 Table 1 Benchmark data sets used in the experiments. Data set ∗

Audiology ( ) Breast cancer Car evaluation Dermatology Mushroom Nursery (∗ ) Soybean (∗ ) Zoo (∗ )

Examples

Classes

Imbalance: max–min [%]

Attributes

AVOs

169 (226) 286 1728 366 8124 12958 (12960) 675 (683) 84 (101)

5 (24) 2 4 6 2 4 (5) 18 (19) 4 (7)

33.7–11.8 (25.2–0.4) 70.3–29.7 70.0–3.8 30.6–5.5 51.8–48.2 33.3–2.5 (33.3–0.0002) 13.6–2.1 (13.5–1.2) 48.8–11.9 (40.6–4.0)

69 9 6 34 22 8 35 16

8 6 6 33 17 6 19 1

Table 2 Maximum degree of test costs reduction without the deterioration of prediction accuracy. Data set

tv∗

Gain [%]

O sOBCtAV v=t v∗ [%]

O sOBCtAV v=1 [%]

Audiology Breast Car Dermatology Mushroom Nursery Soybean Zoo

0.99 0.6 0.6 0.99 0.99 0.75 0.95 0.99

76.2 90.1 48.4 75.9 89.4 57.0 64.4 78.1

89.9 74.2 86.0 96.4 98.6 89.9 94.2 100.0

88.9 72.5 86.1 97.9 99.7 90.1 94.3 100.0

5.2. Experimental results The experimental results of the first phase are presented in Fig. 3. The threshold values – the parameter of the stopping criterion – were set as follows: 1, 0.99, 0.95, 0.90, 0.85, 0.80, 0.75, 0.70, 0.65, 0.60. For these thresholds we present prediction accuracy and the reduction degree of test costs (Gain). These measures are defined as follows:

Accuracy = (n/N) ∗ 100   Tc Gain = 1 − ∗ 100 T

sOBC included examples represented precisely (i.e., described by specific values of all attributes). On the contrary, testing sets initially included examples represented by missing values only (i.e., a root value for each AVO). Their representation precision was sequentially increased until a stopping criterion was activated or classified examples were represented precisely for each attribute. We have chosen m-estimation of probabilities (4) because this estimation performed better than Laplace estimation (3) in our experiments [3]. The parameters of m-estimation were the following: pa = nC j /N, m = 0.0001, where nC j is the number of training examples with class label Cj , N is the number of training examples. The experimental design involved two phases. The first phase was focused on examining the behaviuor of our method apllied to the datasets with AVOs (sOBCAVO ) with different threshold values on the finding the best threshold values that reduce test costs while maintaining the classification accuracy. In our experimental study we assumed that all tests had the same cost, thus our method could be considered as a specific approach to feature selection in the setting with trivial AVO. Therefore, in the second phase we compared our approach (sOBC) with two popular feature selection schemes: filter and wrapper.

(11)

where: n is the number of correctly classified test examples, N is the number of test examples; Tc is the cost of tests conducted with the stopping criterion, T is the cost of tests conducted without activating the stopping criterion, Tc ≤ T. These measures are presented on different scales (70–100% for Accuracy on the left axis, 0–100% for Gain on the right axis). We have decided to use different scales and axes in order to show a more detailed insight into the changes of the Accuracy measure. Let us indicate that for the threshold value equal to 1, the stopping criterion is not activated and the sequential classification is conducted as long as the representation of a classified example contains abstract values. Gain in such a case is equal to 0 (Tc = T). The prediction accuracy for this threshold is determined for precisely represented classified examples and is used as a reference point for the prediction accuracies obtained for smaller thresholds, when stopping criterion breaks the sequential classification process. The results presented in Fig. 3 confirm our hypothesis that it is possible to reduce the cost of tests while maintaining prediction accuracy. The general observation is the following: decreasing the threshold value we increase Gain. At the same time the prediction accuracy is maintained for a certain threshold range, and then begins to decrease. Thus, we are able to indicate such a threshold value tv∗ for which we get the maximum Gain while maintaining prediction accuracy. These threshold values for each dataset are given in Table 2.

Table 3 The comparison of sOBC with standard feature selection schemes. Data set Audiology t v∗ = 0.99 Breast t v∗ = 0.6 Car t v∗ = 0.65 Dermatology t v∗ = 0.99 Mushroom t v∗ = 0.6 Nursery t v∗ = 0.8 Soybean t v∗ = 0.99 Zoo t v∗ = 0.8

Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%] Acc. [%] Gain [%]

sOBCt v=1

sOBCt v=t v∗

sOBCt v=0.95

NB

NBF

NBW

88.9 0 72.5 0 86.1 0 97.9 0 99.7 0 90.1 0 94.3 0 100 0

89.6 77.1 73.2 77.5 86.1 39.3 96.2 65.1 99.4 90.8 90.1 39.5 94.5 64.2 100 81.5

87.81 81.60 72.5 10.1 86.1 33.3 95.7 68.3 99.8 89.1 90.1 26.2 93.0 70.0 100 81.4

87.0 0 71.1 0 85.5 0 97.3 0 95.8 0 90.2 0 92.2 0 97.6 0

84.9 88.4 73.0 54.2 85.5 0 98.1 44.4 98.5 81.8 90.1 4.9 92.0 36.3 99.6 45.5

84.0 88.7 72.3 63.4 85.3 8.8 97.2 62.6 99.6 85.0 90.3 0 92.8 52.7 96.8 63.7

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

41

Fig. 3. Experimental evaluation of sOBCAVO for different tv values – the parameter of the stopping criterion.

O The prediction accuracy for these threshold values (sOBCtAV v=t v∗ ) and AV O for the threshold value equal to 1 (sOBCt v=1 ) are also given in Table 2. We can observe that it is possible to obtain a very significant reduction of test costs while maintaining prediction accuracy. For example, for the data set Breast the reduction is equal 90.1%, which means, that we need to conduct 9.9% of tests only. For two datasets (Der-

matology and Mushroom), even a very high threshold value equal to 0.99 causes the deterioration of prediction accuracy. However, for these datasets we can observe at the same time a significant Gain for this threshold value. At the end, let us notice that for some datasets (Audiology, Breast) we are even able to increase the prediction accuracy for some threshold values. Let us notice, that it is possible to

42

T. Łukaszewski, S. Wilk / Knowledge-Based Systems 92 (2016) 35–42

indicate a fixed threshold value (0.95), for which we may obtain quite significant Gain with rather small deterioration of the prediction accuracy for each dataset. In our experimental study we assumed that all tests had the same cost, thus our method could be considered as a specific approach to feature selection in the setting with trivial AVO. Therefore, in the second phase we compared our approach (sOBC) with two popular feature selection schemes: filter and wrapper. We have used Weka software [35] to run these programs. We have selected correlation based selection (CfsSubsetEval) with greedy search (BestFirst) as a filter (NBF ) and internal cross validation (WrapperSubsetEval) with greedy search (BestFirst) as a wrapper (NBW ). Naive Bayes implemented in Weka was used as a classifier for the filtered training data and also inside the wrapper. The results of the second phase are given in Table 3. The results reveal that wrapper works better than filter in terms of Gain (the only one exception was observed for the Nursery data set). Our method with optimized threshold value (sOBCt v=t v∗ ) offered the improved Gain in comparison to wrapper (the only exception was observed for the Audiology data set). Moreover, our method with a fixed threshold value (sOBCt v=0.95 ) offered also the improved Gain in comparison to wrapper (the only exception was observed for the Audiology and Breast data bases).

6. Conclusions We have proposed a novel approach to the problem of classification with test costs, that is the costs of obtaining attribute values of classified examples. We assume that attribute values are represented at different levels of abstraction and model domain background knowledge. The proposed method sequentially explores levels of abstraction during classification. This allows us to reduce test costs while maintaining the prediction accuracy. In the experimental evaluation of our method on 8 UCI data sets we have been able to reduce test costs at least up to 48% while maintaining the prediction accuracy. We admit that this observable reduction requires tuning the threshold value (considered in the stopping condition) separately for each data set. However, we have observed it has been possible to obtain acceptable results for the fixed threshold value of 0.95. This value can be used as the rule of a thumb for new data sets. We have also considered our method in the setting without AVO and with uniform test costs. Then it boils down to feature selection similar to incremental wrapper described in [36], although more local (limited to the context of a single classified example). Experimental comparison to filter and wrapper schemes demonstrates that for the majority of the considered data sets our method has yielded better results (smaller set of selected features while maintaining the prediction accuracy) than both competitive schemes. As part of future work we are going to explore our approach in the classification with different test costs. Moreover, we are going to examine the effect of more complex AVOs (e.g., with larger number of levels) on the test costs reduction.

Acknowledgments The authors are grateful for the anonymous reviewers’ insightful comments and valuable sugestions, which have substantially improved the quality of this paper. References [1] V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos, Recent advances and emerging challenges of feature selection in the context of big data, Knowl.-Based Syst. 86 (2015) 33–45. [2] M. Saar-Tsechansky, F.J. Provost, Handling missing values when applying classification models, J. Mach. Learn. Res. 8 (2007) 1623–1657.

[3] T. Łukaszewski, J. Józefowska, A. Lawrynowicz, L. Józefowski, A. Lisiecki, Controlling the prediction accuracy by adjusting the abstraction levels, in: Proceedings of the 6th International Conference on Hybrid Artificial Intelligent Systems, 2011, pp. 288–295. [4] P.D. Turney, Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm, J. Artif. Intell. Res. 2 (1995) 369–409. [5] S. Zhang, Z. Qin, C.X. Ling, S. Sheng, missing is useful: missing values in costsensitive decision trees, IEEE Trans. Knowl. Data Eng. 17 (12) (2005) 1689–1693. [6] J. Zhang, D. Kang, A. Silvescu, V. Honavar, Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data, Knowl. Inf. Syst. 9 (2) (2006) 157–179. [7] M. Ye, X. Wu, X. Hu, D. Hu, Knowledge reduction for decision tables with attribute value taxonomies, Knowl.-Based Syst. 56 (2014) 68–78. [8] K. Pancerz, A. Lewicki, Encoding symbolic features in simple decision systems over ontological graphs for PSO and neural network based classifiers, Neurocomputing 144 (2014) 338–345. [9] K. Pancerz, A. Lewicki, R. Tadeusiewicz, Ant-based extraction of rules in simple decision systems over ontological graphs, Appl. Math. Comput. Sci. 25 (2) (2015) 377–387. [10] T. Łukaszewski, S. Wilk, Sequential classification by exploring levels of abstraction, in: Proceedings of the 18th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, 2014, pp. 309–317. [11] Z. Qin, C. Zhang, T. Wang, S. Zhang, Cost sensitive classification in data mining, in: Proceedings of the 6th International Conference on Advanced Data Mining and Applications, 2010, pp. 1–11. [12] S. Lomax, S. Vadera, A survey of cost-sensitive decision tree induction algorithms, ACM Comput. Surv. 45 (2) (2013) 16. [13] S. Norton, Generating better decision trees., in: Proceedings of International Joint Conference on Artificial Intelligence, IJCAI, 89, 1989, pp. 800–805. [14] M. Núñez, The use of background knowledge in decision tree induction, Mach. Learn. 6 (3) (1991) 231–250. [15] M. Tan, Cost-sensitive learning of classification knowledge and its applications in robotics, Mach. Learn. 13 (1) (1993) 7–33. [16] J.V. Davis, J. Ha, C.J. Rossbach, H.E. Ramadan, E. Witchel, Cost-sensitive decision tree learning for forensic classification, in: Proceedings of the 17th European Conference on Machine Learning, 2006, pp. 622–629. [17] R. Greiner, A.J. Grove, D. Roth, Learning cost-sensitive active classifiers, Artif. Intell. 139 (2) (2002) 137–174. [18] V.B. Zubek, T.G. Dietterich, Pruning improves heuristic search for cost-sensitive learning, in: Proceedings of the Nineteenth International Conference on Machine Learnig, ICML, 2002, pp. 19–26. [19] C.X. Ling, Q. Yang, J. Wang, S. Zhang, Decision trees with minimal costs, in: Proceedings of the Twenty-first International Conference on Machine Learning, ICML, 2004, pp. 69–76. [20] X. Chai, L. Deng, Q. Yang, C.X. Ling, Test-cost sensitive naive bayes classification, in: Proceedings of the 4th IEEE International Conference on Data Mining, ICDM, 2004, pp. 51–58. [21] S. Sheng, C.X. Ling, A. Ni, S. Zhang, Cost-sensitive test strategies, in: Proceedings of the Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, 2006, pp. 482–487. [22] S. Lomax, S. Vadera, M. Saraee, A multi-armed bandit approach to cost-sensitive decision tree learning, in: Proceedings of the 12th IEEE International Conference on Data Mining Workshops, 2012, pp. 162–168. [23] Z. Qin, S. Zhang, C. Zhang, Cost-sensitive decision trees with multiple cost scales, in: Proceedings of the 17th Australian Joint Conference on Artificial Intelligence, AI 2004, 2004, pp. 380–390. [24] A. Ni, X. Zhu, C. Zhang, Any-cost discovery: learning optimal classification rules, in: Proceedings of the 18th Australian Joint Conference on Artificial Intelligence, 2005, pp. 123–132. [25] S. Zhang, Cost-sensitive classification with respect to waiting cost, Knowl.-Based Syst. 23 (5) (2010) 369–378. [26] S. Zhang, Decision tree classifiers sensitive to heterogeneous costs, J. Syst. Softw. 85 (4) (2012) 771–779. [27] S. Lomax, S. Vadera, An empirical comparison of cost-sensitive decision tree induction algorithms, Expert Syst. 28 (3) (2011) 227–268. [28] J. Yang, V. Honavar, Feature subset selection using a genetic algorithm, IEEE Intell. Syst. 13 (2) (1998) 44–49. [29] C. Huang, C. Wang, A ga-based feature selection and parameters optimization for support vector machines, Expert Syst. Appl. 31 (2) (2006) 231–240. [30] V. Bolón-Canedo, I. Porto-Díaz, N. Sánchez-Maroño, A. Alonso-Betanzos, A framework for cost-based feature selection, Pattern Recognit. 47 (7) (2014) 2481–2489. [31] Z.E. Xu, M.J. Kusner, K.Q. Weinberger, M. Chen, O. Chapelle, Classifier cascades and trees for minimizing feature evaluation cost, J. Mach. Learn. Res. 15 (1) (2014) 2113–2144. [32] T. Łukaszewski, J. Józefowska, A. Ławrynowicz, L. Józefowski, Handling the description noise using an attribute value ontology, Control Cybern. 40 (2011) 275– 292. [33] B. Cestnik, I. Bratko, On estimating probabilities in tree pruning, in: Proceedings of European Working Session on Learning, EWSL, 1991, pp. 138–150. [34] C. Lee, A gradient approach for value weighted classification learning in naive bayes, Knowl.-Based Syst. 85 (2015) 71–79. [35] R.R. Bouckaert, E. Frank, M.A. Hall, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, WEKA – experiences with a java open-source project, J. Mach. Learn. Res. 11 (2010) 2533–2541. [36] P. Bermejo, J.A. Gámez, J.M. Puerta, Speeding up incremental wrapper feature subset selection with naive bayes classifier, Knowl.-Based Syst. 55 (2014) 140–147.