A cost sensitive decision tree algorithm with two adaptive mechanisms

A cost sensitive decision tree algorithm with two adaptive mechanisms

Knowledge-Based Systems 88 (2015) 24–33 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/...

856KB Sizes 1 Downloads 48 Views

Knowledge-Based Systems 88 (2015) 24–33

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

A cost sensitive decision tree algorithm with two adaptive mechanisms Xiangju Li, Hong Zhao ⇑, William Zhu Lab of Granular Computing, Minnan Normal University, Zhangzhou 363000, China

a r t i c l e

i n f o

Article history: Received 30 January 2015 Received in revised form 16 August 2015 Accepted 18 August 2015 Available online 24 August 2015 Keywords: Adaptive mechanisms Cost sensitive Decision tree Granular computing

a b s t r a c t Decision trees have been widely used in data mining and machine learning as a comprehensible knowledge representation. Minimal cost decision tree construction plays a crucial role in cost sensitive learning. Recently, many algorithms have been developed to tackle this problem. These algorithms choose an appropriate cut point of a numeric attribute by computing all possible cut points and assign a node through test all attributes. Therefore, the efficiency of these algorithms for large data sets is often unsatisfactory. To solve this issue, in this paper we propose a cost sensitive decision tree algorithm with two adaptive mechanisms to learn cost sensitive decision trees from training data sets based on C4.5 algorithm. The two adaptive mechanisms play an important role in cost sensitive decision tree construction. The first mechanism, adaptive selecting the cut point (ASCP) mechanism, selects the cut point adaptively to build a classifier rather than calculates each possible cut point of an attribute. It improves the efficiency of evaluating numeric attributes for cut point selection significantly. The second mechanism, adaptive removing attribute (ARA) mechanism, removes some redundant attributes in the process of selecting node. The effectiveness of the proposed algorithm is demonstrated on fourteen UCI data sets with representative test cost Normal distribution. Compared with the CS-C4.5 algorithm, the proposed algorithm significantly increases efficiency. Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction Data mining is one of the most actively researched areas in information science with important real world applications [42]. Classification is one of the most important tasks in the data mining domain [19,23,24]. There are several techniques for classification, such as k-nearest neighbor algorithms [3], support vector machines [5,14,35], artificial neural networks [16,20], decision trees [1,8,46], rough set theory [25,32,41,45,50], and so on. Decision tree is a useful choice when the tasks are to classify or predict outcomes and to generate easy-to-interpret rules [6,15,39]. The structure of decision tree is simple and easy to interpret. Typical algorithms on decision tree induction, such as ID3 [33], CART [21] and C4.5 [34], have been successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to assess credit risk of loan applicants. Existing techniques aim at training classifiers to minimize the expected number of errors [4,17,26,43]. This implicitly assumes that all classification errors involve the same cost [22,38]. Actually, different misclassification errors often lead to different costs. This leads to a new and hot research topic, cost sensitive learning, which addresses ⇑ Corresponding author. Tel.: +86 136 5604 5192. E-mail address: [email protected] (H. Zhao). http://dx.doi.org/10.1016/j.knosys.2015.08.012 0950-7051/Ó 2015 Elsevier B.V. All rights reserved.

classification problems with cost. It aims to reduce the average total cost involved in the learning process. Cost sensitive learning [9,27,49] is an extension of traditional non-cost sensitive data mining and machine learning [38,40]. Test cost and misclassification cost are two most important types in real world applications [37]. The test cost is money, time, or other resources while we obtain attribute values of an object [18]. The misclassification cost is the cost of assigning an object to class j when it actually belongs in class i [10,11,48]. Several researches focus on the test cost, but fail to take into account misclassification cost. However, it is important to consider both test cost and misclassification cost together [28] in many applications. Some algorithms such as IDX [31], k-ID3 [29], CSGain [7] and CS-C4.5 [13] have been developed to obtain appropriate cost sensitive decision trees. Existing cost sensitive learning techniques work well on small data sets. However, there are still a lot of redundant computations in the process of decision tree construction. On the one hand, the existing techniques choose an appropriate cut point of a numeric attribute by computing all possible cut points. On the other hand, all attributes are tested in the process of assigning node. Therefore, these algorithms have a low efficiency on medium or large data sets. For example, the run time of the CS-C4.5 algorithm is nearly 10,000 ms on Magic data set. This motivates us to propose a new approach for this issue.

25

X. Li et al. / Knowledge-Based Systems 88 (2015) 24–33

In this paper, based on C4.5, we put forward a cost sensitive decision tree algorithm with two adaptive mechanisms. It is an effective method for cost-sensitive decision tree construction. We simply refer to it as ACSDT algorithm. The major contributions of this method are twofold. On the one hand, an adaptive selecting cut point (ASCP) scheme for cut points selecting is assigned. It greatly reduces the amount of candidate cut points than the traditional mechanism of cut points selecting. On the other hand, we design an adaptive removing attribute (ARA) mechanism to remove some redundant attributes in nodes assigning procedure. This mechanism is adaptive to the data size involved rather than being fixed, so it is realistic. These two mechanisms are the key of the ACSDT algorithm. In a word, the new algorithm greatly improves the efficiency of cost sensitive decision tree construction. The proposed algorithm is implemented with Java in our open resource software COSER (Cost sensitive rough sets) [30]. A representative distribution, namely Normal, is employed to generate test costs from a statistical viewpoint. We undertake experiments on fourteen data sets from the UCI (University of California-Irvine) library [2]. Experimental results demonstrate the effectiveness of the ACSDT algorithm. The average total costs obtained by our algorithm are smaller than CS-C4.5 algorithm [13] on twelve data sets which are selected in this experiment. In addition, ACSDT algorithm is more efficient than the existing CS-C4.5 algorithm. In larger data sets, the improvement of efficiency tends to be quite significant. For example, it is more than 115,777 ms improvement of the ACSDT algorithm on Clean data set. The rest of the paper is organized as follows. Section 2 reviews basic knowledge involved in this article, including the decision system with test costs and misclassification costs, and the calculation about average total cost of decision trees. In Section 3, we introduce the existing CS-C4.5 algorithm and give an analysis of the cut points selecting mechanism. Section 4 introduces the cost sensitive decision tree algorithm with two adaptive mechanisms. Section 5 adopts two examples to illustrate the cost sensitive decision tree construction and its average cost calculation procedure. Section 6 presents experiment schemes and provides a simple analysis of the results. Finally, Section 7 presents the conclusions and the future works.

Table 1 A numeric decision system (Liver). Patient

Mcv

Alkphos

Sgpt

Sgot

Gammagt

Drinks

Selector

x1 x2 x3 x4 x5 .. . x344 x345

88.0 92.0 88.0 90.0 86.0 .. . 91.0 98.0

66.0 70.0 61.0 63.0 84.0 .. . 68.0 99.0

20.0 24.0 19.0 16.0 18.0 .. . 27.0 57.0

21.0 13.0 21.0 21.0 14.0 .. . 26.0 45.0

10.0 26.0 13.0 14.0 16.0 .. . 14.0 65.0

0.5 0.5 0.5 1.0 0.5 .. . 16.0 20.0

1 1 2 2 2 .. . 1 1

We adopt the test cost independent model [36] to define the P cost of a test set. That is, tcðBÞ ¼ a2B tcðaÞ for any B # C. The test cost function can be stored in a vector. An example of text cost vector is listed in Table 2. That is, the test costs of Mcv, Alkphos, Sgpt, Sgot, Gammagt, and Drinks are $5, $5, $4, $8, $3, and $5 respectively. We represent the binary class misclassification cost function by Table 3, where mcð1;0Þ stands for cost associated with a minority class object assigned to the majority class and mcð0;1Þ represents the opposite misclassification scenario. Usually, mcð0;0Þ ¼ mcð1;1Þ ¼ 0. A decision system with both test costs and misclassification costs is represented by the following example. Example 1. Consider a Liver decision system listed in Table 1. Table 2 is the test cost vector of Liver decision system. The   0 50 misclassification cost matrix is mc ¼ . The test cost is 100 0 $5+$4+$3=$12 when the conditional attributes Mcv, Sgpt, and Gammagt are selected. In Liver data set, the Selector field is used to split the data into two sets. Let S1 ¼ fxi jSelectorðxi Þ ¼ 1; xi 2 Ug, S2 ¼ fxi jSelectorðxi Þ ¼ 2; xi 2 Ug. The number of objects of S1 is jS1 j ¼ 145, and the number of objects of S2 is jS2 j ¼ 200. That is, if a patient xi 2 S1 ðxi 2 S2 Þ is misclassified as S2 ðS1 Þ, a penalty of $100ð$50Þ is paid.

2.2. The average total cost of a cost sensitive decision tree 2. Preliminaries In this section, we review the basic knowledge including the decision system with test costs and misclassification costs, and the calculation about average total cost of decision trees.

Let T be a decision tree, U be the testing data set, and x 2 U. x follows a path from the root of T to a leaf. Let the set of attributes on the path be Sx . The test cost of x is

tcðxÞ ¼ tcðSx Þ ¼

X tcðaÞ:

ð2Þ

a2Sx

2.1. The decision system with test costs and misclassification costs In data mining and machine learning, the decision system with test costs and misclassification costs is an important concept and defined as follows. Definition 1 [47]. A decision system with test costs and misclassification costs (DS-TM) is the 7-tuple:

S ¼ ðU; C; D; V ¼ fV a ja 2 C [ Dg; I ¼ fIa ja 2 C [ Dg; tc; mcÞ;

ð1Þ

where U is a nonempty finite set of objects called the universe, C is a nonempty finite set of condition attributes, D is a nonempty finite set of decision attributes, fV a g is a set of values for each attribute a 2 C [ D, Ia is an information function for each attribute a 2 C [ D (i.e. Ia : U ! V a ), tc is a test cost function (i.e. C ! Rþ [ f0g), and mc is a misclassification cost matrix (i.e. mc : k  k ! Rþ [ f0g). Table 1 presents a decision system of Bupa liver disorder (Liver for short). Where U ¼ fx1 ; x2 ; x3 ; . . . ; x345 g, C = {Mcv, Alkphos, Sgpt, Sgot, Gammagt, Drinks}, and D = {Selector}.

We denote the real class label of x be Rx and the prediction x is denoted as P x projected by classification T. The misclassification cost of x is mcðxÞ ¼ mcðRx ; Px Þ. The total cost of x is tcðxÞ þ mcðxÞ.

Table 2 An example of test cost vector. a

Mcv

Alkphos

Sgpt

Sgot

Gammagt

Drinks

tcðaÞ

$5

$5

$4

$8

$3

$5

Table 3 Binary class misclassification cost matrix. Actual

Majority class Minority class

Predicted Majority class

Minority class

mcð0;0Þ mcð1;0Þ

mcð0;1Þ mcð1;1Þ

26

X. Li et al. / Knowledge-Based Systems 88 (2015) 24–33

Definition 2 [29]. The average total cost ðATOCÞ of decision tree T on U is:

ATOCðUÞ ¼

X tcðxÞ þ mcðxÞ x2U

jUj

:

ð3Þ

A numeric attribute can be used to classify once more in classify process. After an attribute is measured, we have known the values of this attribute, and it is no necessary for us to test it again. Hence, when attribute a is used again, we let tcðaÞ ¼ 0 in Eq. (3) to avoid computing the test cost repeatedly.

3. CS-C4.5 algorithm for cost sensitive decision tree Cost sensitive learning is an extension of traditional inductive learning for minimizing classification total cost. Learning from data with both test cost and misclassification cost is especially interesting. Recently, the CS-C4.5 algorithm for cost sensitive decision trees, which works well in real applications, has been published. In this section, we introduce the CS-C4.5 algorithm [13] and give a simple analysis with an example. 3.1. The heuristic function Attribute selection is a fundamental process in decision tree induction. In each induction step (i.e., generating a node) of building the decision tree, it is necessary to choose one attribute to split the remaining data. A lot of strategies, such as the information gain ratio function [34] and GINI index criterion [21], have been proposed for choosing the splitting attribute at each node. CS-C4.5 algorithm considers both information gain and test cost. The heuristic function in the CS-C4.5 algorithm is

f ðaÞ ¼ Gainða; pa Þ=ðtcðaÞ  /a Þx ;

ð4Þ

where a is a conditional attribute, tcðaÞ is the test cost of the attribute a; pa is the best threshold value of attribute a and Gainða; pa Þ is the information gain of the attribute a. x is a user provided parameter to trade off the cost and the information gain. It is a positive number in [0, 1]. /a is a risk factor used to penalize a particular type of tests, known as delayed test, which are tests, such as blood tests, where there is a time lag between requesting and receiving the information [44]. 3.2. The cut points selecting mechanism A numeric attribute is typically discretized during decision tree generation by partitioning its range into two intervals. Let Pa be the set of the values of numeric attribute a, and Vax be the value of numeric attribute a of an object x. pa 2 Pa is a threshold value of attribute a. To an object x, if Vax 6 pa , we assign it to the left branch, otherwise we assign it to the right branch. Let lbða; pa Þ ¼ fxi jVaxi 6 pa ; xi 2 Ug; rbða; pa Þ ¼ fxi jVaxi > pa ; xi 2 Ug. We call such a threshold value, pa , a cut point [12]. The information gain of the partition induced by pa is denoted by Gainða; pa Þ. The cut point pa for which Gainða; pa Þ is maximal amongst all the candidate cut points is taken as the best cut point and GainðaÞ ¼ Gainða; pa Þ. The following example helps explain this mechanism. Example 2. To illustrate the cut points selecting mechanism, we randomly select twenty objects U 0 from U which are listed in Table 4. In this table, jS1 j ¼ 11 and jS2 j ¼ 9. Let Sl1 ða; pa Þ ¼ fxi jSelectorðxi Þ ¼ 1; xi 2 lbða; pa Þg, Sl2 ða; pa Þ ¼ fxi jSelectorðxi Þ ¼ 2; xi 2 lbða; pa Þg, Sr1 ða; pa Þ ¼ fxi jSelectorðxi Þ ¼ 1; xi 2 rbða; pa Þg, Sr2 ða; pa Þ ¼ fxi jSelectorðxi Þ ¼ 2; xi 2 rbða; pa Þg. Let E be the information

Table 4 An example of numeric decision system (Liver). Patient

Mcv

Alkphos

Sgpt

Sgot

Gammagt

Drinks

Selector

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20

88.0 92.0 88.0 90.0 86.0 87.0 86.0 89.0 91.0 85.0 87.0 93.0 85.0 95.0 89.0 91.0 84.0 82.0 95.0 91.0

66.0 70.0 61.0 63.0 84.0 52.0 109.0 77.0 102.0 54.0 71.0 99.0 79.0 78.0 63.0 57.0 92.0 62.0 36.0 80.0

20.0 24.0 19.0 16.0 18.0 21.0 16.0 26.0 17.0 47.0 32.0 36.0 17.0 27.0 22.0 31.0 68.0 17.0 38.0 37.0

21.0 13.0 21.0 21.0 14.0 19.0 22.0 20.0 13.0 33.0 19.0 34.0 8.0 25.0 27.0 23.0 37.0 17.0 19.0 23.0

10.0 26.0 13.0 14.0 16.0 30.0 28.0 19.0 19.0 22.0 27.0 48.0 9.0 30.0 10.0 42.0 44.0 15.0 15.0 27.0

0.5 0.5 0.5 1.0 0.5 0.5 6.0 1.0 0.5 0.5 1.0 6.0 0.5 2.0 4.0 0.5 0.5 0.5 6.0 4.0

1 1 2 2 2 2 2 1 1 2 1 2 1 2 1 1 2 1 1 1

entropy of this table, Elb be the the information entropy of the left branch and Erb be the the information entropy of the right P jS j=jU 0 j Þ ¼ 11=20 branch. We have that: E ¼  2i¼1 ðjSi j=jU 0 j  log 2 i 11=20

9=20

log 2 þ 9=20  log 2 ¼ 0:9709. For simply, we show the cut points selecting mechanism of attribute Mcv. From Table 4 we have that: P Mcv ¼ f82; 84; 85; 86; 87; 88; 89; 90; 91; 92; 93g, lbðMcv ; 82Þ ¼ fx18 g, rbðMcv ; 82Þ ¼ U 0  lbðMcv ; 82Þ. The number of objects of lbðMcv ; 82Þ is jlbðMcv ; 82Þj ¼ 1 and the number of objects of rbðMcv ; 82Þ is jrbðMcv ; 82Þj ¼ 19. Elb ¼ P jS ðMcv ;82Þj=jlbðMcv ;82Þj  2i¼1 jSli ðMcv ; 82Þj=jlbðMcv ; 82Þj  log 2 li ¼ 0. Erb ¼ 10=19  log 2 þ 9=19  log 2 ¼ 0:9980. GainðMcv ; 82Þ ¼ E ðjlbðMcv ; 82Þj=jU 0 j  Elb þ jrbðMcv ; 82Þj=jU 0 j  Erb Þ ¼ 0:0447. 10=19

9=19

Similarly, we have that: GainðMcv ; 84Þ ¼ 8:060  104 , GainðMcv ; 85Þ ¼ 0:0018, GainðMcv ; 86Þ ¼ 0:0591, GainðMcv ; 87Þ ¼ GainðMcv ; 89Þ ¼ 0:0110, 0:0600, GainðMcv ; 88Þ ¼ 0:0667, GainðMcv ; 90Þ ¼ 0:0435, GainðMcv ; 91Þ ¼ 0:0208, GainðMcv ; 92Þ ¼

GainðMcv Þ ¼ 0:0242 and GainðMcv ; 93Þ ¼ 8:060  104 . TheremaxpMcv 2PMcv fGainðMcv ; pMcv Þg ¼ GainðMcv ; 88Þ ¼ 0:0667. fore, the selected cut point of attribute Mcv is 88. On EEG Eve State (Eeg) data set, the cut points and information gain of each attribute obtained by the CS-C4.5 algorithm are recorded. For simply, we show the candidate cut points of eight attributes on Eeg data set in Fig. 1. The horizontal coordinate represents candidate cut points of attribute and the vertical coordinate represents the information gain. From Fig. 1 we can observe the following. Each attribute is evaluated nearly 400 times when selecting the best cut point on Eeg data set. From the above example and experiment analysis, we can observe that this scheme of selecting a cut point is dissatisfactory since there are usually many attributes and many objects on each data set. The huge number of candidate cut points makes the computation space too large to search. Therefore, we study an algorithm with a new scheme for cut point selections in Section 4. 4. Cost sensitive decision tree algorithm with two adaptive mechanisms In this section, we introduce a heuristic function for attribute selection. An adaptive selecting cut point (ASCP) scheme is designed for cut point selections. We propose a cost sensitive decision tree algorithm with two adaptive mechanisms which is

27

X. Li et al. / Knowledge-Based Systems 88 (2015) 24–33

Attribute: AF3

Attribute: O1

Attribute: F7

Attribute: O2 0.25

0.06 0.04 0.02

0.1 0.08 0.06 0.04 0.02 0

0.1 0.05 0

3950 4000 4050 4100 4150 Cut points Attribute: P7

4060

0.2 0.15 0.1 0.05 0

4080 4100 4120 Cut points Attribute: T8

4580 4600 4620 4640 4660 Cut points Attribute: AF4

0.05

Information gain

0.1

0.15 0.1 0.05

0 4060408041004120414041604180 Cut points

0

4580

4600 4620 Cut points

4640

Information gain

0.2 Information gain

Information gain

0 4250 4300 4350 4400 4450 Cut points Attribute: FC5 0.15

0.15

Information gain

0.08

Information gain

Information gain

Information gain

0.1

0.06 0.04 0.02 0

0.15 0.1 0.05 0 4300

4200 4220 4240 4260 4280 Cut points

(a)

4400 4500 Cut points

(b) Fig. 1. The information gain with different cut points on Eeg data set.

designed for minimal cost decision tree construction. This algorithm is called as ACSDT algorithm.

Algorithm 1. Adaptive selecting cut point scheme Input: pai ; step Method: ASCP  Output: Quality

4.1. The heuristic function Attribute selection is a fundamental process in decision tree induction. The C4.5 algorithm adopts the information gain ratio as heuristic function. The information gain ratio of attribute a is

1:

Quality ¼ Qualityðai ; pai Þ;

2:

GainRatioða; pa Þ ¼ Gainða; pa Þ=Split  inforða; pa Þ;

3: 4:

if (Qualityðai ; pai þ stepÞ > Quality ) then if (Qualityðai ; pai þ stepÞ P Qualityðai ; pai  stepÞ) then

ð5Þ

where a; pa and Gainða; pa Þ have the same meanings as in CS-C4.5 algorithm and Split inforða; pa Þ is the split information entropy of the attribute a. Obviously, when gain ratio of an attribute is larger, it contains more information. The proposed cost sensitive decision tree algorithm employs the heuristic function based on C4.5 as below:

5: 6: 7: 8: 9: 10:

ð6Þ

11: 12:

where tcðaÞ is the test cost of attribute a and k is a non-positive number. It is introduced to adjust the influence of the test cost. That is, an attribute with lower test cost and more information takes more advantage in the choice. Let P a be the set of all candidate cut points of numeric attribute a. We rewrite Eq. (6) as below:

13: 14: 15: 16: 17: 18:

Qualityða; pa Þ ¼ GainRatioða; pa Þ  ð1 þ tcðaÞÞk ;

QualityðaÞ ¼ max fQualityða; pa Þg: pa 2P a







Quality ¼ Qualityðai ; pai þ stepÞ; pai ¼ pai þ step; step ¼ step=2; ASCP(pai ; step); end if end if  if (Qualityðai ; pai  stepÞ > Quality ) then if (Qualityðai ; pai  stepÞ P Qualityðai ; pai þ stepÞ) then 

Quality ¼ Qualityðai ; pai  stepÞ; pai ¼ pai  step; step ¼ step=2; ASCP(pai ; step); end if end if  return Quality

ð7Þ

Compared with the heuristic function of the CS-C4.5 algorithm, our heuristic function adopts information gain ratio to express the attribute’s classified ability rather than information gain. It can reduce biased towards attributes that have more values [34]. In addition, our heuristic function degrades to the heuristic function of the C4.5 algorithm when attribute a is used again.

In Step 1, we compute the values of Qualityða; pa Þ; Qualityða; pa þ stepÞ and Qualityða; pa  stepÞ. Meanwhile, the maximum among these three values is obtained.

4.2. The adaptive scheme for cut point selections

In Step 2, we select Qualityða; pa Þ as the return value if it is the maximum among the three values. On the contrary, we will perform this algorithm again if Qualityða; pa þ stepÞ or Qualityða; pa  stepÞ is the maximum. Lines 4 to 7 and Lines 12 to 15 illustrate this process clearly.

In this section, an adaptive selecting cut point (ASCP) scheme is introduced in Algorithm 1. It gives us a detailed description of ASCP scheme for cut point selections. And it contains two main steps.

Example 3. Consider an example of numeric decision system in Table 4. Let k ¼ 1. We denote by maxV a the maximum value of attribute a, by minV a the minimum value of a, that is

28

X. Li et al. / Knowledge-Based Systems 88 (2015) 24–33

maxV a ¼ maxxi 2U fVaxi g; minV a ¼ minxi 2U fVaxi g. For simplicity, this example illustrates the calculation process of heuristic function value of attribute Mcv. Step 1. Sort the objects by the value of the attribute Mcv and find its maximum value ðmaxV Mcv Þ and minimum value ðminV Mcv Þ. From Table 4 we have that: maxV Mcv ¼ 95; minV Mcv ¼ 82. pMcv ¼ ð95 þ 82Þ=2 ¼ 88:5 and step ¼ ð95  82Þ=4 ¼ 3:25. Step 2. Compute the heuristic function value of Mcv when the cut point is pMcv (that is 88.5). The information gain of attribute Mcv is

GainðMcv ; pMcv Þ ¼ GainðMcv ; 88:5Þ ¼ 0:0667:

The information gain ratio of attribute Mcv is GainRatioðMcv ; 88:5Þ ¼ GainðMcv ; 88:5Þ=Split  inforðMcv ; 88:5Þ ¼ 0:0667. The value of heuristic function is QualityðMcv ; 88:5Þ ¼ GainRatioðMcv ; 88:5Þ  ð1 þ tcðMcv ÞÞk

¼ 0:0667  ð1 þ 5Þ1 ¼ 0:0111. Step 3. Compute the heuristic function value of Mcv when the cut point is pMcv þ step (that is 91.75). GainðMcv ; pMcv þ stepÞ ¼ GainðMcv ; 91:75Þ ¼ 0:0208. GainRatioðMcv ; 91:75Þ ¼ GainðMcv ; 91:75Þ=Split infor ðMcv ; 91:75Þ ¼ 0:0263. QualityðMcv ; 91:75Þ ¼ 0:0044. Step 4. Compute the heuristic function value of Mcv when the cut point is pMcv  step (that is 85.25). GainðMcv ; pMcv  stepÞ ¼ GainðMcv ; 85:25Þ ¼ 0:0018. GainRatioðMcv ; 85:25Þ ¼ GainðMcv ; 85:25Þ=Split infor ðMcv ; 85:25Þ ¼ 0:0025. QualityðMcv ; 85:25Þ ¼ 0:0004. Step 5. According to Eq. (7), we obtain that QualityðMcv Þ ¼ 0:0111. We only need evaluate three times for attribute Mcv in Example 3 by ASCP scheme. It is evaluated eleven times in Example 2 by the traditional cut points selecting mechanism. The traditional cut points selecting mechanism must evaluate N  1 times for each attribute if the N examples have distinct values. As networks connecting computational resources get faster, an increasing number unprecedented amounts of data occur in our life. N is typically very large. The ASCP scheme adopts an adaptive method for searching the best cut point without being influenced by N. Despite ASCP scheme looks simplicity, it has a good performance for selecting the cut point of attributes. A simple experiment illustrates it clearly. We compare the traditional scheme of selecting cut points with ASCP scheme on a number of data sets. The candidate cut points and information gain of each attribute are recorded. The candidate cut points and information gain of twelve attributes on Wdbc data set are shown in Fig. 2. The horizontal coordinate represents different candidate cut points of attribute and the vertical coordinate represents the gain ratio. With Fig. 2, we have the following observations. (1) The ASCP scheme is very effective. As can be seen from Fig. 2, the ASCP scheme can select the best cut point in most cases. (2) From Fig. 2 (b), we can find that only for attributes a07 and a28 the ASCP scheme can not obtain the best cut point. For other ten attributes, the ASCP scheme can obtain the best cut point. Hence, the performance of ASCP scheme is acceptable.

(3) In addition, the calculation of the ASCP scheme is much smaller than the traditional scheme. It is easily find from Fig. 2, there are only less than 10 candidate cut points are evaluated by new scheme of each attribute. 4.3. Cost sensitive decision tree algorithm with two adaptive mechanisms In this section, we provide a detailed description of the ACSDT algorithm which is listed in Algorithm 2. It contains six main steps. In the following, we detail each of the steps of the algorithm. Algorithm 2. Cost sensitive decision tree algorithm Input: the training data set S; the set of attribute C, parameter @ Method: ACSDT Output: tree, A decision tree 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:

Create a node tree; if (S is pure or C is empty) then return tree as a leaf node; end if maxQuality ¼ 0; The max value of the heuristic function //Select attribute with the hightest value of heuristic function for ði ¼ 0; i < jCj; i þ þÞ do Compute the max value (denoted as maxValue) and the minimal value (denoted as minValue) of attribute ai ; cp ¼ 12 ðmaxValue þ minValueÞ; step ¼ 14 ðmaxValue  minValueÞ;

Qualityðai Þ = ASCP(cp; step); if ðQualityðai Þ > maxQualityÞ then A ¼ ai ; maxQuality ¼ QualityðAÞ; else //Remove attribute if (jCj > @ and Qualityðai Þ < 1@  maxQuality) then C ¼ C  fai g; end if end if end for if (maxQuality ¼ 0) then return tree; end if tree ¼ tree A; tcðAÞ ¼ 0; //Split S into two data sets: S1 ; S2 . Put the object with VAxi 6 cp (VAxi > cp) into S1 ðS2 Þ; for ði ¼ 1; i 6 2; i þ þÞ do ACSDT(Si ; C); end for

Step 1. corresponds to Line 1. The tree starts as a single node representing the training objects. Step 2. contains Lines 2–4. If the objects are all of the same class or there are no remaining attributes on which the objects may be further partitioned, then the node becomes a leaf. Step 3. corresponds to Lines 7–19 which contain the key code of Algorithm 2. In this step, we select the best attribute among these attributes as a node. First, the max value and the minimal value of attribute a are obtained in Line 8. Then, we compute the median value and set a step of the attribute a for obtaining the value of the heuristic function QualityðaÞ. Meanwhile, we can get the best cut point cp by ASCP scheme which is described in Section 4.2. Finally, Lines 11–13 is a description of comparison process.

29

0.5

0

400

600 800 1000 1200 Cut points Attribute: a13

0.2 0

2

4 Cut points

6

80

100 120 Cut points Attribute: a07

0.5

0

0.05

0.1 0.15 0.2 Cut points Attribute: a16

0.2 0.1 0

0.02

Traditional mechanism

0.04 Cut points

0.06

Information gain

0

0.1 0

0.005

0.01 0.015 0.02 Cut points Attribute: a22

0.2 0.1 0

20

25 30 Cut points Attribute: a28

35

0.5

0

0.05

0.1 0.15 Cut points

0.2

Dynamic mechanism

Information gain

25

Attribute: a18 0.2

Information gain

20 Cut points Attribute: a04

Information gain

15

Information gain

0

Information gain

0.1

Attribute: a03 0.5

Information gain

Information gain

Attribute: a02 0.2

Information gain

Information gain

Information gain

Information gain

X. Li et al. / Knowledge-Based Systems 88 (2015) 24–33 Attribute: a21 0.5

0

15

20 Cut points Attribute: a26

25

0.4 0.2 0

0.1

0.2

0.3 0.4 Cut points Attribute: a30

0.5

0.1 0.05 0

0.08 0.1 Cut points

Traditional mechanism

0.12

Dynamic mechanism

Fig. 2. The information gain with different cut points on Wdbc data set.

We select the best attribute A from these attributes according to the values of QualityðaÞ obtained from ASCP scheme. Step 4. contains Lines 15–17. An adaptive removing attribute (ARA) mechanism is described in this step. @ in Line 16 is a positive parameter to adjust how many attributes be removed in decision tree construction. That is, the bigger of the parameter the more attribute be removed in the process of nodes selections. To illustrate the ARA mechanism clearly, an example is presented in Example 4. Step 5. corresponds to Line 25. A branch is created for VAxi 6 cp or VAxi > cp respectively. If an object’s value of attribute A is less than or equal to cp, it is put into the left branch. Otherwise, it is put into the right branch. Then the data set is divided into two data sets. Step 6. contains Lines 26–28. The algorithm recursive applies the same process to generate a decision tree for the data sets S1 and S2 . An important problem is to label a node (N) when it has difference classes examples. In our method, for each node, let MA denote the nodes for a majority class and let MI denote the nodes for a minority class, the criterion is designed as:  MA; if mcð1;0Þ  mi 6 mcð0;1Þ  ma  MI; if mcð1;0Þ  mi > mcð0;1Þ  ma where mi is the number of MI class and ma is the number of MA class. Example 4. Consider German data set which contains 24 condition attributes and 1000 objects. Let @ ¼ 10. The values of heuristic function for different attributes of German data set on deciding the first node are listed in Table 5. Let bestNode denote the current best attribute on selecting the first node. The process of deciding first node is as follows. First, initialize bestNodeðbestNode ¼ Qualityða1 Þ ¼ 9:5  103 Þ. Second, compare the value of heuristic function of next attribute 1 bestNode, we delete ðQualityða2 ÞÞ with bestNode. If Qualityða2 Þ < 10 a2 . That is, a2 is not considered in the process of next node selecting. If Qualityða2 Þ > bestNode we update bestNode. We

Table 5 The value of quality of attribute on German data set. attributeðaÞ

a1

a2

a3

a4

a5

a6

a7



a24

QualityðaÞ (103 )

9.5

4.4

26.0

10.4

7.0

2.7

0.3



1.1

neither delete a2 nor update bestNode since 1 bestNode < Qualityða Þ < bestNode. Qualityða Þ > bestNode, so 2 3 10 we update bestNode ðbestNode ¼ Qualityða3 ÞÞ. Attributes a4 ; a5 ; a6 have the same situation as a2 . Attribute a7 is deleted since 1 bestNode. Similarly, we examine each attribute, in Qualityða7 Þ < 10 turn, and judge whether it should be deleted. Finally, the attribute with the highest heuristic function value is taken as the first node.

5. ACSDT algorithm for cost sensitive decision tree Decision trees are built using a set of data referred to as the training data set. A different set, is used to check the model, called the test data set. When we obtain a new object of the test data set, we can make a prediction on the state of the class variable by following the path in the tree from the root to a leaf node. The following example helps explain this process. Example 5. Consider a Liver decision system illustrated in Example 1. The Liver data set are randomly split into two subsets: 60% for training ðU tr Þ and 40% for testing ðU te Þ. For simplicity, we set k ¼ 1; @ ¼ 10. Fig. 3 illustrates a cost sensitive decision tree ðTÞ obtained by the ACSDT algorithm on training data set of Liver data set. The circles denote non-leaf nodes of T and the leaf nodes of T are denoted by rectangles. LN ¼ fA; B; . . . ; Og is the set of leaf node of T. As can be seen from Fig. 3, there are three kinds of information in every leaf node: ‘‘labelðnum1 ; num2 Þ”. Where ‘‘label” is the prediction class of the objects in this leaf node, ‘‘num1 ” is the number of the objects in this leaf node, and ‘‘num2 ” is the number of objects whose real real class is not ‘‘label”. For instance, the information in J is ‘‘1ð40; 8Þ”, where ‘‘1” indicates the class of the objects in leaf node J, ‘‘40” is the number of objects in J, and ‘‘8” is the number of objects whose real class is not ‘‘1” but been misclassified as class ‘‘1”.

30

X. Li et al. / Knowledge-Based Systems 88 (2015) 24–33

Fig. 3. The cost sensitive decision tree on training data set of Liver data set.

We adopt average total cost to test the performance of cost sensitive tree. An example is given to illustrate the calculation of the average costs of the cost sensitive decision tree. Example 6. A cost sensitive decision tree is obtained from the decision system in Example 5. Let X M be the set of the objects in leaf node M. At M is the set of the attributes to reach the leaf node M. To illustrate the calculation process, we randomly select twenty objects are listed in Table 4. Step 1. According to the data set ðUÞ in Table 4 and cost sensitive decision tree ðTÞ, we can obtain X i ði 2 A; B; . . . ; OÞ as follows. X C ¼ fx5 ; x9 ; x13 ; x18 g, X D ¼ fx1 ; x3 ; x4 g, X H ¼ fx15 g, X J ¼ fx8 ; x19 g, X K ¼ fx2 ; x6 ; x7 ; x14 g, X L ¼ fx10 ; x11 ; x12 ; x16 ; x17 ; x20 g, and there are no object in other leaf nodes. Step 2. From the structure of T we can know that: At C = {Gammagt, Sgpt, Drinks, Sgot}, At D = {Gammagt, Sgpt, Drinks, Sgot}, AtH = {Gammagt, Sgpt, Drinks, Alkphos}, At J = {Gammagt, Sgpt}, and At K = {Gammagt, Sgpt}, AtL = {Gammagt, Sgpt, Drinks, Sgot}. The test cost of the leaf node C is

tcðX C Þ ¼ jX C j  tcðAt C Þ ¼ jX C j 

X

tcðxÞ ¼ 4  ð$3 þ $4

a2AtC

þ $5 þ $8Þ ¼ $80: Similarly,

tcðX D Þ ¼ jX D j  tcðAt D Þ ¼ 3  ð$3 þ $4 þ $5 þ $8Þ ¼ $60;

The total test cost is

X X tcðxÞ ¼ tcðX y Þ ¼ tcðX A Þ þ tcðX B Þ þ    þ tcðX O Þ x2U

y2LN

¼ $80 þ $60 þ $17 þ $14 þ $28 þ $120 ¼ $319: Step 3. The total misclassification cost of the leaf node C is mcðCÞ ¼ 1  mð0;1Þ ¼ 1  $50 ¼ $50. Similarly,

mcðDÞ ¼ 1  mcð1;0Þ ¼ 1  $100 ¼ $100; mcðHÞ ¼ 1  mcð1;0Þ ¼ 1  $100 ¼ $100; mcðJÞ ¼ 0  mcð0;1Þ ¼ 0  $50 ¼ $0; mcðKÞ ¼ 0  mcð1;0Þ ¼ 1  $100 ¼ $100; mcðLÞ ¼ 3  mcð0;1Þ ¼ 3  $50 ¼ $150: The total misclassification cost is

X X mcðxÞ ¼ mcðyÞ ¼ mcðAÞ þ mcðBÞ þ    þ mcðOÞ ¼ $500: x2U 0

y2LN

Step 4. The average total cost is P ¼ $319þ$500 ¼ $40:95. ATOCðU 0 Þ ¼ x2U 0 tcðxÞþmcðxÞ 20 jU 0 j

tcðX H Þ ¼ jX H j  tcðAt H Þ ¼ 1  ð$3 þ $4 þ $5 þ $5Þ ¼ $17;

6. Experiments

tcðX J Þ ¼ jX J j  tcðAtJ Þ ¼ 2  ð$3 þ $4Þ ¼ $14;

To test the effectiveness and efficiency of the proposed algorithm, we compare the performance of the ACSDT algorithm with the existing CS-C4.5 algorithm on fourteen standard data sets [2]. These standard data sets obtained from the UCI Repository of Machine Learning Databases [2]: Biodegradability (Biodeg), Breast,

tcðX K Þ ¼ jX K j  tcðAt K Þ ¼ 4  ð$3 þ $4Þ ¼ $28; tcðX L Þ ¼ jX L j  tcðAtL Þ ¼ 6  ð$3 þ $4 þ $5 þ $8Þ ¼ $120:

31

X. Li et al. / Knowledge-Based Systems 88 (2015) 24–33

Clean, Credit, Diabetes (Diab), EEG Eve State (Eeg), German, Ionosphere (Iono), Major Atmospheric Gamma Imaging Cherenkov Telescope project (Magic), Promoters (Prom), Sonar, Spam, Wisconsin Diagnostic Breast Cancer (Wdbc), and Wisconsin Prognostic Breast Cancer (Wpbc). The information about these data sets are summarized in Table 6. jCj is the number of condition attributes, jUj is the number of objects, and D is the name of the decision attribute. Since these data sets have no test cost settings, for statistical purposes, we apply Normal distribution to generate random test costs in [1,10]. All the selected data sets are randomly split into two subsets: 60% for training and 40% for testing. The misclassification cost is represented by a matrix which is   0 mcð0;1Þ defined as mc ¼ . For simplicity, we set mcð1;0Þ 0 mcð0;1Þ ¼ $50 and mcð1;0Þ ¼ 10mcð1;0Þ ¼ $500 in our experiments. At the same time, we set @ ¼ 10. k ranges from 2 to 0.25 and the step-size is 0.25. x ranges from 0 to 1 and the step-size is

0.125. Let /a be 1 since we suppose all tests are undertaken in parallel. In this section, we try to answer the following questions by experimentation. (1) Is the ACSDT algorithm appropriates for the minimal cost decision tree construction? (2) Is the ACSDT algorithm efficient?

6.1. Effectiveness of the ACSDT algorithm For the purposes of evaluating effectiveness of the ACSDT algorithm, we perform the ACSDT algorithm and CS-C4.5 algorithm with 300 different test cost settings on fourteen data sets with Normal distribution. Fig. 4 shows three average total costs obtained by CS-C4.5 algorithm and ACSDT algorithm. AMC represents the average

100 90

90 80

70

70

60

60

Average cost

Average cost

80

100 AMC of CS−C4.5 AMC of ACSDT ATC of CS−C4.5 ATC of ACSDT

50 40

50 40

30

30

20

20

10

10

0

Biodeg

Breast

Clean

Credit Diab Data sets

Eeg

0

German

AMC of CS−C4.5 AMC of ACSDT ATC of CS−C4.5 ATC of ACSDT

Iono

Magic

Prom

Sonar Spam Data sets

Wdbc

Wpbc

Fig. 4. Comparison of three average costs between CS-C4.5 algorithm and ACSDT algorithm on fourteen data sets.

8

10

6

CS−C4.5 ACSDT

10

CS−C4.5 ACSDT 7

10

5

10

Time (ms)

Time (ms)

6

10

5

10 4

10

4

10

3

10

3

Biodeg

Breast

Clean

Credit Diab Data sets

Eeg

German

10

Iono

Magic

Prom

Sonar Spam Data sets

Fig. 5. Comparison of the run time of CS-C4.5 algorithm and ACSDT algorithm on fourteen data sets.

Wdbc

Wpbc

32

X. Li et al. / Knowledge-Based Systems 88 (2015) 24–33

Table 6 Data sets information. Name

Domain

jCj

jUj

D

Biodeg Breast Clean Credit Diab Eeg German Iono Magic Prom Sonar Spam Wdbc Wpbc

Chemical Clinic Society Commerce Clinic Life Society Physics Physical Game Physical Computer Clinic Clinic

41 9 166 20 8 14 24 34 10 57 60 57 30 33

1055 699 476 1000 768 1923 1000 351 19020 106 208 4601 569 198

Type Class Class Class Class Class Class Class Class Class Class Class Diagnosis Outcome

(3) The run time of two algorithms depends on the data set size significantly. For example, the Magic data set has the longest run time while it also has the most objects and a rational number of attributes among the data sets tested. From the above experiments, we can observe the following. (1) The ACSDT algorithm is very effective. The average total costs obtained by the proposed algorithm are smaller than the existing CS-C4.5 algorithm in most cases. Fig. 4 shows this situation clearly. (2) The proposed algorithm is significantly efficient compared with the CS-C4.5 algorithm. It is clear from Fig. 5 that the run time of the ACSDT algorithm is much lower than the existing algorithm. In a word, the performance of the ACSDT algorithm is better than the existing CS-C4.5 algorithm. 7. Conclusion and further study

misclassification costs and ATC represents the average test costs in Fig. 4. From the results we observe the following. (1) The ACSDT algorithm performs better than CS-C4.5 algorithm. As can be seen from Fig. 4 (a), the average total costs obtained by ACSDT algorithm are significant less than CS-C4.5 on selected data sets, especially on Credit, Diab and German data sets. (2) It is easy to find that the average misclassification costs obtained by the ACSDT algorithm are smaller than the CSC4.5 algorithm on most cases. (3) Only on Iono data set do the average total costs obtained by CS-C4.5 algorithm are smaller than the ACSDT algorithm. On Clean, Eeg, Wdbc, and Wpbc data sets, two algorithms have nearly the same performance while our algorithm has a better performance than the CS-C4.5 algorithm on other nine data sets. However, our algorithm performs better than the CS-C4.5 algorithm on selected data sets. (4) On Eeg data set, the average misclassification cost obtained by ACSDT algorithm is bigger than the average misclassification cost obtained by CS-C4.5 algorithm. However, the average total cost obtained by ACSDT algorithm is smaller than the results obtained by CS-C4.5 algorithm. Therefore, there is a trade-off between the test cost and the misclassification cost.

Minimal cost decision tree construction is an important task in applications. It is necessary to consider the test cost and the misclassification cost in these tasks. Existing techniques work well on small data sets. However, these techniques choose an appropriate cut point of a numeric attribute by computing all possible cut points. In addition, all attributes are tested in the process of assigning node. As a result, there are a lot of redundant computations in the process of decision tree construction. In this paper, we propose a cost sensitive decision tree algorithm with two adaptive mechanisms for minimal cost decision tree construction. The two adaptive mechanisms are adaptive selecting the cut point mechanism and adaptive removing attribute mechanism. They are improve the efficiency of evaluating numeric attributes for cut point selection and nodes assigning. We performed a set of experimental evaluation and demonstrated that the effectiveness and feasibility of our approach. The experimental results show that the proposed algorithm is more effective and efficient than the existing CS-C4.5 algorithm. With regard to future research, much work needs to be undertaken. On the one hand, the current implementation of the algorithm deals only with binary class problems that is the principal limitation. In the future, the extending algorithm needs to be proposed to cope with multivariate class problems. On the other hand, a new decision tree pruning technique needs to be designed for minimal cost decision tree construction problem. Acknowledgments

Fig. 4 gives us an intuitive understanding of the results. To sum up, the performance of the ACSDT algorithm is better than the existing CS-C4.5 algorithm. Generally, the results obtained by the ACSDT algorithm are acceptable. In next Section, we will address the efficiency of the ACSDT algorithm. 6.2. Efficiency of the ACSDT algorithm We study the efficiency of the ACSDT algorithm by comparing the run time of the ACSDT algorithm with the CS-C4.5 algorithm. The run time of the two algorithms is shown in Fig. 5 which gives an intuitive understanding of results on fourteen data sets. With Fig. 5, we have the following observations. (1) The ACSDT algorithm is more efficiency than CS-C4.5 algorithm. It is easy to find that the run time of the ACSDT algorithm is less than CS-C4.5 algorithm on all fourteen data sets we selected in this experiment. (2) On larger data sets the advantage of the new algorithm tend to be more significant. For instance, the run time of the ACSDT algorithm is 115,777 ms shorter than the CS-C4.5 algorithm on Clean data set.

This work is in part supported by the Key Project of Education Department of Fujian Province under Grant No. JA13192, the National Natural Science Foundation of China under Grant Nos. 61379049, 61379089 and 61170128, the Science and Technology Key Project of Fujian Province under Grant No. 2012H0043, and the Zhangzhou Municipal Natural Science Foundation under Grant No. ZZ2014J14. Appendix A. Supplementary material Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.knosys.2015.08. 012.These data include MOL files and InChiKeys of the most important compounds described in this article. References [1] O.F. Althuwaynee, B. Pradhan, H.J. Park, J.H. Lee, A novel ensemble decision tree-based chi-squared automatic interaction detection (chaid) and multivariate logistic regression models in landslide susceptibility mapping, Landslides 11 (6) (2014) 1063–1078.

X. Li et al. / Knowledge-Based Systems 88 (2015) 24–33 [2] C. Blake, C.J. Merz, Uci repository of machine learning databases [http://www. ics.uci.edu/mlearn/mlrepository.html], University of california, Department of Information and Computer Science 55, Irvine, CA. [3] J. Bobadilla, F. Ortega, A. Hernando, G. Glez-de Rivera, A similarity metric designed to speed up, using hardware, the recommender systems k-nearest neighbors algorithm, Knowl.-Based Syst. 51 (2013) 27–34. [4] Y.M. Chen, Q.X. Zhu, H.R. Xu, Finding rough set reducts with fish swarm algorithm, Knowl.-Based Syst. 81 (2015) 22–29. [5] M. Claesen, D.S. Frank, J.A. Suykens, D.M. Bart, Ensemblesvm: a library for ensemble learning using support vector machines, J. Machine Learn. Res. 15 (1) (2014) 141–145. [6] J.H. Dai, W.T. Wang, X. Qing, H.W. Tian, Uncertainty measurement for intervalvalued decision systems based on extended conditional entropy, Knowl.-Based Syst. 27 (2012) 443–450. [7] J. Davis, J. Ha, C. Rossbach, H. Ramadan, E. Witchel, Cost-sensitive decision tree learning for forensic classification, in: Machine Learning: ECML 2006, Springer, 2006, pp. 622–629. [8] T. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization, Machine Learn. 40 (2) (2000) 139–157. [9] P. Domingos, Metacost: a general method for making classifiers cost-sensitive, in: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 1999. [10] C. Elkan, The foundations of cost-sensitive learning, in: International Joint Conference on Artificial Intelligence, vol. 17, Citeseer, 2001. [11] W. Fan, S. Stolfo, J.X. Zhang, P.K. Chan, Adacost: misclassification cost-sensitive boosting, in: ICML, Citeseer, 1999. [12] U. Fayyad, K. Irani, On the handling of continuous-valued attributes in decision tree generation, Machine Learn. 8 (1) (1992) 87–102. [13] A. Freitas, A. Costa-Pereira, P. Brazdil, Cost-sensitive decision trees applied to medical data, in: Data Warehousing and Knowledge Discovery, Springer, 2007, pp. 303–312. [14] Q. He, Z.X. Xie, Q.H. Hu, C.X. Wu, Neighborhood based sample and feature selection for svm classification learning, Neurocomputing 74 (10) (2011) 1585–1594. [15] T. Ho, The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Machine Intell. 20 (8) (1998) 832–844. [16] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (5) (1989) 359–366. [17] Q.H. Hu, W. Pan, S. An, P.J. Ma, J.M. Wei, An efficient gene selection technique for cancer recognition based on neighborhood mutual information, Int. J. Machine Learn. Cybernet. 1 (1–4) (2010) 63–74. [18] X.Y. Jia, W.H. Liao, Z.M. Tang, L. Shang, Minimum cost attribute reduction in decision-theoretic rough set models, Inform. Sci. 219 (2013) 151–167. [19] L.X. Jiang, Z.H. Cai, D.H. Wang, H. Zhang, Improving tree augmented naive bayes for class probability estimation, Knowl.-Based Syst. 26 (2012) 239–245. [20] M. Khashei, M. Bijari, A novel hybridization of artificial neural networks and arima models for time series forecasting, Appl. Soft Comput. 11 (2) (2011) 2664–2675. [21] P. Kristensen, M. Judge, L. Thim, U. Ribel, K. Christjansen, B. Wulff, J. Clausen, P. Jensen, O. Madsen, N. Vrang, Hypothalamic cart is a new anorectic peptide regulated by leptin, Nature 393 (6680) (1998) 72–76. [22] J.H. Li, C.L. Mei, C.A. Kumar, X. Zhang, On rule acquisition in decision formal contexts, Int. J. Machine Learn. Cybernet. 4 (6) (2013) 721–731. [23] D. Liu, T.R. Li, D.C. Liang, Incorporating logistic regression to decision-theoretic rough sets for classifications, Int. J. Approx. Reason. 55 (1) (2014) 197–210. [24] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, G.Q. Zhang, Transfer learning using computational intelligence: a survey, Knowl.-Based Syst. 80 (2015) 14–23.

33

[25] C. Luo, T.R. Li, H.M. Chen, D. Liu, Incremental approaches for updating approximations in set-valued ordered information systems, Knowl.-Based Syst. 50 (2013) 218–233. [26] X.A. Ma, G.Y. Wang, H. Yu, T.R. Li, Decision region distribution preservation reduction in decision-theoretic rough set model, Inform. Sci. 278 (2014) 614– 640. [27] F. Min, H.P. He, Y.H. Qian, W. Zhu, Test-cost-sensitive attribute reduction, Inform. Sci. 181 (22) (2011) 4928–4942. [28] F. Min, W. Zhu, Minimal cost attribute reduction through backtracking, in: Database Theory and Application, Bio-Science and Bio-Technology, Springer, 2011, pp. 100–107. [29] F. Min, W. Zhu, A competition strategy to cost-sensitive decision trees, in: Rough Sets and Knowledge Technology, Springer, 2012, pp. 359–368. [30] F. Min, W. Zhu, H. Zhao, Coser: cost-senstive rough sets, 2014. [31] S. Norton, Generating better decision trees, in: Proceedings of the 11th International Joint Conference on Artificial Intelligence, 1989, pp. 800–805. [32] Z. Pawlak, Rough sets and intelligent data analysis, Inform. Sci. 147 (1) (2002) 1–12. [33] J. Quinlan, Induction of decision trees, Machine Learn. 1 (1) (1986) 81–106. [34] J. Quinlan, C4. 5: Programs for Machine Learning, vol. 1, Morgan kaufman, 1993. [35] Y.J. Tian, Z.Q. Qi, X.C. Ju, Y. Shi, X.H. Liu, Nonparallel support vector machines for pattern classification, IEEE Trans. Cybernet. 44 (7) (2014) 1067–1079. [36] P.D. Turney, Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm, J. Artif. Intell. Res. 2 (1995) 369– 409. [37] P.D. Turney, Types of cost in inductive concept learning, in: Proceedings of the Workshop on Cost-Sensitive Learning at the 17th ICML, 2000. [38] T. Wang, Z.X. Qin, S.C. Zhang, C.Q. Zhang, Cost-sensitive classification with inadequate labeled data, Inform. Syst. 37 (5) (2012) 508–516. [39] J.M. Wei, S.Q. Wang, M.Y. Wang, J.P. You, D. Liu, Rough set based approach for inducing decision trees, Knowl.-Based Syst. 20 (8) (2007) 695–702. [40] T. Wu, M. Hsu, Credit risk assessment and decision making by a fusion approach, Knowl.-Based Syst. 35 (2012) 102–110. [41] X.B. Yang, Y. Qi, H.L. Yu, X.N. Song, J.Y. Yang, Updating multigranulation rough approximations with increasing of granular structures, Knowl.-Based Syst. 64 (2014) 59–69. [42] E. Yen, I.-W.M. Chu, Relaxing instance boundaries for the search of splitting points of numerical attributes in classification trees, Inform. Sci. 177 (5) (2007) 1276–1289. [43] H. Yu, Z.G. Liu, G.Y. Wang, An automatic method to determine the number of clusters using decision-theoretic rough set, Int. J. Approx. Reason. 55 (1) (2014) 101–115. [44] S.C. Zhang, Decision tree classifiers sensitive to heterogeneous costs, J. Syst. Softw. 85 (4) (2012) 771–779. [45] X.H. Zhang, J.H. Dai, Y.C. Yu, On the union and intersection operations of rough sets based on various approximation spaces, Inform. Sci. 292 (2015) 214–229. [46] Y.D. Zhang, S.H. Wang, P. Phillips, G.L. Ji, Binary pso with mutation operator for feature selection using decision tree applied to spam detection, Knowl.-Based Syst. 64 (2014) 22–31. [47] H. Zhao, F. Min, W. Zhu, Cost-sensitive feature selection of numeric data with measurement errors, J. Appl. Math. 2013 (2013) 1–13. [48] H. Zhao, W. Zhu, Optimal cost-sensitive granularization based on rough sets for variable costs, Knowl.-Based Syst. 65 (2014) 72–82. [49] Z.H. Zhou, X.Y. Liu, On multi-class cost-sensitive learning, Comput. Intell. 26 (3) (2010) 232–257. [50] W. Zhu, F. Wang, Reduction and axiomization of covering generalized rough sets, Inform. Sci. 152 (2003) 217–230.