Pattern Recognition Letters 27 (2006) 892–899 www.elsevier.com/locate/patrec
Learning probabilistic decision trees for AUC Harry Zhang *, Jiang Su Faculty of Computer Science, University of New Brunswick, P.O. Box 4400, Fredericton, NB, Canada E3B 5A3 Available online 13 December 2005
Abstract Accurate ranking, measured by AUC (the area under the ROC curve), is crucial in many real-world applications. Most traditional learning algorithms, however, aim only at high classification accuracy. It has been observed that traditional decision trees produce good classification accuracy but poor probability estimates. Since the ranking generated by a decision tree is based on the class probabilities, a probability estimation tree (PET) with accurate probability estimates is desired in order to yield high AUC. Some researchers ascribe the poor probability estimates of decision trees to the decision tree learning algorithms. To our observation, however, the representation also plays an important role. In this paper, we propose to extend decision trees to represent a joint distribution and conditional independence, called conditional independence trees (CITrees), which is a more suitable model for yielding high AUC. We propose a novel AUC-based algorithm for learning CITrees, and our experiments show that the CITree algorithm outperforms the state-of-the-art decision tree learning algorithm C4.4 (a variant of C4.5), naive Bayes, and NBTree in AUC. Our work provides an effective model and algorithm for applications in which an accurate ranking is required. 2005 Elsevier B.V. All rights reserved. Keywords: Decision trees; AUC; Naive Bayes; Ranking
1. Introduction Classification is one of the most important tasks in machine learning and pattern recognition. In classification, a classifier is built from a set of training examples with class labels. A key performance measure of a classifier is its predictive accuracy (or error rate, 1 accuracy). Many classifiers can also produce the class probability estimates p(cjE) that is the probability of an example E in the class c. However, this information is largely ignored—the error rate does not consider how ‘‘far-off’’ (be it 0.45 or 0.01) the prediction of each example is from its target, but only the class with the largest probability estimate. In many applications, however, classification and error rate are not enough. For example, in direct marketing, we often need to promote the top X% of customers during gradual roll-out, or we often deploy different promotion
*
Corresponding author. Fax: +1 506 453 3566. E-mail address:
[email protected] (H. Zhang).
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.10.013
strategies to customers with different likelihoods of buying some products. To accomplish these tasks, we need more than a mere classification of buyers and non-buyers. We need (at least) a ranking of customers in terms of their likelihoods of buying. Thus, a ranking is much more desirable than just a classification. If we are aiming at accurate ranking from a classifier, one might naturally think that we must need the true ranking of the training examples. In most scenarios, however, that is not possible. Most likely, what we are given is a data set of examples with class labels. Fortunately, when only a training set with class labels is given, the area under ROC (receiver operating characteristics) curve (Swets, 1988; Provost and Fawcett, 1997), or simply AUC, can be used to evaluate classifiers that also produce rankings. Hand and Till (2001) show that, for binary classification, AUC is equivalent to the probability that a randomly chosen example of class will have a smaller estimated probability of belonging to class + than a randomly chosen example of class +. They present a simple approach to calculating the AUC of a classifier G below:
H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899
b ¼ S 0 n0 ðn0 þ 1Þ=2 ; A n0 n1
ð1Þ
where n0 and n1 are the numbersP of negative and positive examples, respectively, and S0 = ri, where ri is the rank of ith positive example in the ranking. From Eq. (1), it is clear that AUC is essentially a measure of the quality of a ranking. For example, the AUC of a ranking is 1 (the maximum value of AUC) if there is no positive example preceding a negative example. If we are aiming at accurate probability-based ranking, what is the performance of the traditional learning algorithms, such as decision trees and naive Bayes? While decision trees perform quite well in classification, it is also found that their probability estimates are poor (Pazzani et al., 1994; Provost et al., 1998). Building decision trees with accurate probability estimates, called probability estimation trees (PETs), has received a great deal of attention recently (Provost and Domingos, 2003). Some researchers ascribe the poor probability estimates of decision trees to the decision tree learning algorithms. Thus, many techniques have been proposed to improve the learning algorithms in producing accurate probability estimates (Provost and Domingos, 2003). To our observation, however, the representation also plays an important role. Indeed, the representation of decision trees is fully expressive theoretically, but it is often impractical to learn such a representation with accurate probability estimates from limited training data. In a decision tree, the class probability p(cjE) is estimated by the fraction of the examples of class c in the leaf into which E falls. Thus, the class probabilities of all the examples in the same leaf are equal. This is an obstacle in building an accurate PET, because two contradictory factors are in play at the same time. On one hand, traditional decision tree algorithms, such as C4.5, prefer a small tree. Thus, a leaf has more examples and the class probability estimates are more reliable. A small tree, however, has a small number of leaves, thus more examples will have the same class probability. That prevents the learning algorithm from building an accurate PET. On the other hand, if the tree is large, not only may the tree overfit the training data, but also the number of examples in each leaf is also small, and thus the probability estimates would not be accurate and reliable. Such a contradiction does exist in traditional decision trees. Our motivation is to build a model to produce accurate ranking by extending the representation of traditional decision trees not only to represent accurate probabilities but also to be easily learnable from limited data in practice. Naturally, if an accurate PET is built, the ranking yielded by it should also be accurate, since an accurate approximation of p(cjE) is found and can be used for ranking. In other words, its AUC should be high. In this paper, a training example is represented by a vector of attribute values and a class label. We denote a vector of attributes by an bold-face upper-case letter A, A =
893
(A1, A2, . . . , An), and an assignment of value to each attribute in A by a corresponding bold-face lower-case letter a. We use C to denote the class variable and c to denote its value. Thus, a training example E = (a, c), where a = (a1, a2, . . . , an), and ai is the value of attribute Ai. A classifier is a function that maps an example to a class label. The rest of the paper is organized as follows. Section 2 introduces the related work on learning decision trees with accurate probability estimates and ranking. Section 3 presents a novel model for ranking and a corresponding algorithm. In Section 4, we present empirical experiments. The paper concludes with discussion and some directions for future work. 2. Related work Traditional decision tree algorithms, such as C4.5, have been observed to produce poor estimates of probabilities (Pazzani et al., 1994; Provost et al., 1998). According to Provost and Domingos (2003), the decision tree representation, however, is not (inherently) doomed to produce poor probability estimates, and a part of the problem is that modern decision tree algorithms are biased against building the tree with accurate probability estimates. Provost and Domingos propose the following techniques to improve the AUC of C4.5. (1) Smooth probability estimates by Laplace correction. Assume that there are p examples of the class at a leaf, N total examples, and C total classes. The frequencybased estimation calculates the estimated probability as Np . The Laplace estimation calculates the estimated probability as Npþ1 . þC (2) Turn off pruning and collapsing. Provost and Domingos (2003) show that pruning a large tree damages the probability estimation. Thus, a simple strategy to improve the probability estimation is to build a large tree without pruning. Provost and Domingos call the resulting algorithm C4.4. They compared C4.4 to C4.5 by empirical experiments, and found that C4.4 is a significant improvement over C4.5 with regard to AUC. Ling and Yan (2003) propose a method to improve the AUC of a decision tree. They present a novel probability estimation algorithm, in which the class probability of an example is an average of the probability estimates from all leaves of the tree, instead of only using the leaf into which it falls. In other words, each leaf contributes to the class probability estimate of an example. Ferri et al. (2003) propose a new probability smoothing technique m-branch smoothing for decision trees, in which the class distributions of all nodes from the root to each leaf are taken into account. In learning a decision tree, a critical step is to choose the ‘‘best’’ attribute in each step. The entropy-based splitting
894
H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899
criteria, such as information gain, gain ratio, have been widely used. Recently, Ferri et al. (2002) propose a novel splitting criterion based on ROC curve. Their experiments show that the new algorithm results in better probability estimates, without sacrificing accuracy. A questionable point of traditional decision trees (including probabilistic trees) is that only the attributes along the path from the root to a leaf are used in both classification and probability estimation. Since a small tree is preferred by traditional decision tree learning algorithms, many attributes may not be used. This is a more serious issue in ranking than classification. Kohavi (1996) proposes to deploy a naive Bayes in each leaf, and the resulting decision tree is called an NBTree. The algorithm for learning an NBTree is similar to C4.5. After a tree is grown, a naive Bayes is constructed for each leaf using the data associated with that leaf. An NBTree classifies an example by sorting it to a leaf and applying the naive Bayes in that leaf to assign a class label to it. Actually, deploying a model at leaves to calibrate the probability estimates of a decision tree has been proposed by Symth et al. (1996). They also notice that every example from a particular leaf has the same probability estimate, and thus suggest to place a kernel-based probability density estimator at each leaf. Our work is inspired by the works of Kohavi and Symth et al., but from different point of view. Indeed, if a local model that incorporates the attributes not occurring on the path is deployed at each leaf, together with the conditional probability of the attributes occurring on the path, the resulting tree represents accurate probabilities. If the structure of standard decision trees is learned and used in the same way as in C4.5, however, the leaf models would not directly and explicitly benefit from the structure, and thus would still play a role of smoothing. Our motivation is how to learn and use the structure of a tree to explore conditional independencies among attributes, such that a simple leaf model, like naive Bayes, gives accurate probability estimates. Then, the resulting model is more compact and more easily learnable, while its representation is still accurate. Since the probability estimates are more accurate, the ranking yielded by it is also more accurate.
3.1. Probabilistic decision trees Fig. 1 shows an example of a probabilistic tree (Buntine, 1991), in which each leaf L represents a conditional distribution p(CjAp(L)), where Ap(L) are the attributes that occur in the path from the root to L. For simplicity, the attributes that occur in the path is called the path attributes of L, and all other attributes are called the leaf attributes of L, denoted by Al(L). In practice, p(CjAp(L)) is often estimated by using the fraction of examples of class C in L, and the classification of a decision tree is based on p(CjAp(L)). Thus, from the probabilistic point of view, a decision tree defines a classifier, shown as C dt ðEÞ ¼ arg max pðcjap ðLÞÞ; ð2Þ c
where L is the leaf into which E falls, ap(L) is the value of the path attributes of L, and Cdt(E) is the classification given by the decision tree. In a decision tree, p(cjap(L)) is actually used as an approximation of p(cjE). Thus, all the examples falling into the same leaf have the same class probability. 3.2. Conditional independence trees In a probabilistic tree, a leaf L represents the conditional probability distribution p(CjAp(L)). If there is a representation of the conditional probability distribution over the leaf attributes at each leaf, called the local conditional distribution and denoted by p(Al(L)jAp(L), C), then each leaf represents a full joint distribution over all the attributes, shown as pðA; CÞ ¼ apðCjAp ðLÞÞpðAl ðLÞjAp ðLÞ; CÞ;
ð3Þ
where a is a normalization factor. A probabilistic decision tree T is called a joint probabilistic tree, if each of its leaves represents both the conditional probability distribution p(CjAp(L)) and p(Al(L)jAp(L), C). A joint probability tree T is called a conditional independence tree, or simply CITree, if the local conditional independence assumption, shown in Eq. (4), is true for each leaf L:
A1
3. Understanding decision trees from probabilistic perspective Even though there theoretically exists a decision tree with accurate probability estimates for any given problem, such a tree tends to be large and learnable only when sufficient (huge) training data are available. This issue is called the fragmentation problem (Pagallo and Haussler, 1990; Kohavi, 1996). In practice, a small tree is preferred. Thus, poor probability estimates are yielded. Therefore, the representation of a decision tree should be extended to represent accurate probabilities and be learnable from limited training data.
0
1
A3
A2 1
A3 1
P(C=+)=0.8 P(C=_)=0.2
1
0
A3 0
P(C=+)=0.4 P(C=_)=0.6
1
P(C=+)=0.7 P(C=+)=0.3
0
P(C=+)=0.7
P(C=+)=0.1
P(C=+)=0.3
P(C=+)=0.9
0
P(C=+)=0.1 P(C=+)=0.9
Fig. 1. An example of an probabilistic tree.
H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899
pðAl ðLÞjAp ðLÞ; CÞ ¼
m Y
pðAli jC; Ap ðLÞÞ;
ð4Þ
i¼1
where Al(L) = (Al1, Al2, . . . , Alm) are the leaf attributes of L. Given an example E, the class probability p(cjE) is computed as follows. E is sorted from the root to a leaf using its attribute values (path attributes), and then the local model on the leaf is applied to compute the probability p(cjE) only using leaf attributes. The structure of a CITree represents the conditional independencies among attributes, and its leaves represent a joint distribution. A CITree is different from a probabilistic tree in the following aspects. (1) A CITree represents a joint distribution over all the attributes, but a probabilistic tree represents only the conditional probability distribution of the path attributes. (2) A CITree explicitly defines conditional dependencies among attributes. Notice the conditional independence assumption on which naive Bayes is based on, shown as n Y pðajcÞ ¼ pðai jcÞ. ð5Þ i¼1
Comparing Eq. (4) with Eq. (5), we notice that the local conditional independence assumption of CITrees is a relaxation of the (global) conditional independence assumption of naive Bayes. Thus, the local conditional independence assumption is more realistic in applications. In addition, the local conditional independence represented in a CITree is also different from the conditional independence in a Bayesian network. In a Bayesian network, an attribute A1 is conditionally independent of attribute A2 given A3 means that for all the values of A3, A1 is independent of A2. In a CITree, however, the conditional independence is that A1 is independent of A2, given a specified value of A3. The granularity in a CITree is finer than that in a Bayesian network. It is interesting to notice that, after growing a CITree, if a naive Bayes is deployed on each leaf using only the data associated with it, the naive Bayes, called leaf naive Bayes, represents the actual joint distribution. A leaf naive Bayes in leaf L is shown as m Y C lnb ðEÞ ¼ arg max pL ðcÞ pL ðali jcÞ; ð6Þ c
895
Boolean function is representable by a decision tree. However, naive Bayes has limited expressive power; that is, it can only represent linear Boolean functions (Domingos and Pazzani, 1997). Interestingly, any joint distribution is representable by a CITree. According to the product rule, pðA1 ; A2 ; . . . ; An ; CÞ ¼ pðCÞpðA1 jCÞP ðA2 jA1 ; CÞ P ðAn jA1 ; . . . ; An1 ; CÞ.
ð7Þ
A CITree representing any joint distribution p(A1, A2, . . . , An) is shown in Fig. 2. Thus, CITrees are also fully expressive. The representation of CITrees, however, is more compact than that of decision trees. To show this, let us consider only full dependencies among attributes. An attribute Ai is said to fully depend on Aj, if Ai = Aj. Notice that if an attribute is conditionally independent of all other attributes, it does not occur on any path. If several attributes conditionally depend on one attribute, only that attribute occurs in the path. In the extreme case that the global conditional independent assumption is true, a CITree has only one node, which is just a global naive Bayes. Assume that there are n attributes. The maximum height of a CITree is n2, which corresponds to that each attributes depends exactly on another attribute. The maximum height of a decision tree is n. Our experiments in Section 4 show that the average size of CITrees is much smaller than that of decision trees. 3.3. An AUC-based algorithm for learning CITree From the discussion in the preceding section, a CITree can represent any joint distribution. Thus, a CITree is a perfect PET, and the ranking yielded by CITree is accurate. But in practice, learning the structure of a CITree is just as time-consuming as learning an optimal decision tree. However, a good approximation of a CITree, which gives good estimates of class probabilities, is satisfiable in many applications. If the structure of a CITree is determined, a leaf naive Bayes is a perfect model representing the local conditional distributions at leaves. Building a CITree could be also a greedy and recursive process, similar to building a decision tree. At each step, A1
0
1
i¼1
where pL(c) denotes the probability of examples in L being in c, and pL(alijc) is the probability that the examples of class c have Ali = ali in L. It is obvious that pL(c) = p(cjap(L)) and pL(alijc)Q = p(alijc, ap(L)) on the whole training data. So pL ðcÞ mi¼1 pL ðali jcÞ is proportional to p(cjE). Thus, if the structure of the CITree is found, naive Bayes is a perfect model for leaves. Generally, if the local model is naive Bayes, a CITree can be viewed as a combination of a decision tree and naive Bayes. It is well known that decision trees are fully expressive with the class of propositional language; that is, any
A2
A2 1
0
1 A3
A3
1
0
...
... An
An
An
A3
A3
0
1
0
0
1
...
... An
...
0
1
An
An
Fig. 2. A CITree to represent any joint distribution p(A1, A2, . . . , An), where A1, A2, . . . , An are Boolean attributes.
896
H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899
choose the ‘‘best’’ attribute as the root of the (sub)tree, split the associated data into disjoint subsets corresponding to the values of the attribute, and then recur this process for each subset until certain criteria are satisfied. Notice as well, however, the difference between learning a CITree and learning a decision tree. In building a decision tree, we are looking for a sequence of attributes that leads to the least impurity in all leaves of the tree. The key in choosing an attribute is whether the resulting partition of the examples is ‘‘pure’’ or not. It is natural, since the most common class of a leaf is used as the class of all the examples in that leaf. However, such a selection strategy does not necessarily lead to the truth of the local conditional independence assumption. In building a CITree, we intend to choose the attributes that make the local conditional independence among the rest of attributes true as much as possible. That means that, even though the impurity of its leaves is high, it could still be a good CITree, as long as the leaf attributes are independent. Thus, traditional decision tree learning algorithms are not directly suitable for learning CITrees. In learning a CITree, an attribute, given which all other attributes have the maximum conditional independence, should be selected at each step. Thus, we should select the attribute with the greatest influence on other attributes. Our idea is to try each possible attribute as the root, evaluate the resulting tree, and choose the attribute that achieves the highest AUC. Similar to C4.5, our learning algorithm has two steps: growing a tree and pruning. In growing a tree, each possible attribute is evaluated at each step, and the attribute that gives the most improvement in AUC is selected. The algorithm is depicted below. Algorithm AUC-CITree (T, S, A) Input: CITree T, a set S of labeled examples, a set of attributes A Output: a CITree. (1) For all attributes A in A • Partition S into S1, . . . , Sk, each of which corresponds to a value of A. • Create a leaf naive Bayes for each Si. • Evaluate the AUC on S of the resulting CITree. (2) If the best AUC of the resulting CITrees is not significantly better than the one produced from the naive Bayes on S, make the current node a leaf and return. (3) For all values a of Aopt that achieves the most improvement in AUC • CITree(Ta, Sa, A {Aopt}). • Add Ta as a child of T. (4) For each child A of the parent Ap of Aopt • Make the node A a leaf and evaluate the resulting AUC on Ap. • If it is not significantly worse than the original AUC, make the node A a leaf. (5) Return T.
In the preceding algorithm, we use the relative AUC c AUCo of 5% to define the signifincrease (or reduction) AUCAUC c icance for the improvement of the resulting CITree, where AUCc and AUCo are the new AUC score and the original one, respectively. Notice that the AUC for the children of a node is computed by putting the instances from all the leaves together rather than computing the AUC for each leaf separately. Our AUC-CITree algorithm is different from the NBTree algorithm (Kohavi, 1996) in several aspects: (1) Our AUC-CITree algorithm is based on AUC, instead of accuracy. (2) The AUC-CITree algorithm adopts the post-pruning strategy, rather than early stop. It has been noticed that pruning is detrimental for probability estimation for traditional decision trees (Provost and Domingos, 2003; Ferri et al., 2003). However, it is different in CITrees. Notice that a local model is deployed on each leaf. Without pruning, the probability estimates given by the local models would be not reliable, due to that the number of training examples on each leaf is small. Thus, a pruning is necessary for building a
Table 1 Description of the data sets used in the experiments Data set
Size
Number of attributes
Number of classes
Letter Mushroom Waveform Sick Hypothyroid Chess end-game Splice Segment German credit Vowel Anneal Vehicle Pima Indians diabetes Wisconsin-breast-cancer Credit approval Soybean Balance-scale Vote Horse colic Ionosphere Primary-tumor Heart-c Breast cancer Heart-statlog Audiology Glass Sonar Autos Hepatitis domain Iris Lymph Zoo Labor
20,000 8124 5000 3772 3772 3196 3190 2310 1000 990 898 846 768 699 690 683 625 435 368 351 339 303 286 270 226 214 208 205 155 150 148 101 57
17 22 41 30 30 36 62 20 24 14 39 19 8 9 15 36 5 16 28 34 18 14 9 13 70 10 61 26 19 5 19 18 16
26 2 3 2 4 2 3 7 2 11 6 4 2 2 2 19 3 2 2 2 22 5 2 2 24 7 2 7 2 3 4 7 2
H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899
good CITree. Another alternative strategy is early stop, adopted in NBTree. That is, the tree growing process stops when the size of training data is smaller than a threshold (30 in NBTree). However, from our experiments, pruning is more effective. Notice that both Ling and Yan (2003) and Ferri et al. (2003) are essentially smoothing techniques that are based on the structure of traditional decision trees. CITrees, however, use the structure of a decision tree to represent conditional dependence and deploy a local model on each leaf to produce the class probabilities. Intuitively, CITrees could be more powerful than smoothing techniques. 4. Experiments
897
AUC score of decision trees, but naive Bayes used in our experiments is not. We used 33 UCI (Merz et al., 1997) data sets assigned by Weka, described in Table 1. Numeric attributes are discretized using ten-bin discretization implemented in Weka. Missing values are also processed using the mechanism in Weka. In our experiment, multi-class AUC has been calculated by M-measure (Hand and Till, 2001), and the average AUC on each data set is obtained by using 10-fold stratified cross validation 10 times. In our implementation, we used the Laplace estimation to avoid the zero-frequency problem. We conducted a two-tailed t-test with a 95% confidence level to compare each pair of algorithms on each data set. Table 2 shows the average AUC obtained by the four algorithms. Our observations are summarized below.
We conduct experiments to compare our algorithm CITree with C4.4, naive Bayes, and NBTree. The implementation of C4.4, naive Bayes, and NBTree is from Weka (Witten and Frank, 2000), and C4.4 is J48 in Weka with Laplace correction and turning off pruning and collapsing. Notice that C4.4 is designed specifically for improving the
(1) The CITree algorithm outperforms naive Bayes significantly in terms of AUC: It wins in 9 data sets, ties in 24 data sets, and loses in 0 data set. The average AUC for CITree is 90.31%, higher than the average AUC 89.74% of naive Bayes.
Table 2 Experimental results on AUC Data set
AUC-CITree
NBTree
C4.4
NB
Letter Mushroom Waveform Sick Hypothyroid Chess end-game Splice Segment German credit Vowel Anneal Vehicle Pima Indians diabetes Wisconsin-breast-cancer Credit approval Soybean Balance-scale Vote Horse colic Ionosphere Primary-tumor Heart-c Breast cancer Heart-statlog Audiology Glass Sonar Autos Hepatitis domain Iris Lymph Zoo Labor
98.59 ± 0.15 100 ± 0 95.29 ± 0.68 98.42 ± 1.43 88.26 ± 5.67 99.79 ± 0.24 99.45 ± 0.28 99.33 ± 0.23 79.03 ± 4.2 99.22 ± 0.62 96.05 ± 2.03 86.18 ± 2.71 82.47 ± 5.03 99.15 ± 0.85 91.04 ± 3.19 99.75 ± 0.33 84.08 ± 4.42 98.26 ± 1.73 84.59 ± 7.15 95.33 ± 3.5 78.75 ± 1.72 84.05 ± 0.6 66.94 ± 11.36 90.78 ± 5.1 70.8 ± 0.86 84.27 ± 5.15 78.94 ± 10.44 93.36 ± 2.97 85.65 ± 13.17 98.42 ± 2.13 89.92 ± 1.82 89.44 ± 2.44 95.5 ± 15.55
98.49 ± 0.17 100 ± 0 93.69 ± 0.96 94.27 ± 3.62 87.47 ± 6.34 99.44 ± 0.6 99.43 ± 0.31 99.08 ± 0.34 77.49 ± 5.34 98.46 ± 0.84 96.23 ± 1.29 85.66 ± 3.43 81.99 ± 5.1 99.25 ± 0.75 91.15 ± 3.44 99.68 ± 0.41 84.08 ± 4.42 98.51 ± 1.67 86.28 ± 6.91 94.04 ± 4.42 78.12 ± 1.8 83.93 ± 0.62 66.01 ± 10.94 89.28 ± 6.26 71.06 ± 0.69 82 ± 6.08 77.54 ± 9.9 93.84 ± 3.15 82.77 ± 13.96 98.84 ± 2.01 89.05 ± 2.47 89.48 ± 2.37 97.42 ± 12.06
95.26 ± 0.32 100 ± 0 80.95 ± 1.47 99.08 ± 0.52 82.74 ± 7.58 99.95 ± 0.06 97.91 ± 0.72 98.98 ± 0.38 68.58 ± 4.67 90.57 ± 2.33 94.53 ± 2.31 85.35 ± 3.07 73.89 ± 5.33 97.98 ± 1.44 87.5 ± 3.75 91.32 ± 1.58 58.83 ± 5.31 97.28 ± 2.53 81.91 ± 7.32 92.09 ± 5.2 74.9 ± 2.37 83.11 ± 0.83 58.05 ± 9.93 80.82 ± 9.39 70.51 ± 0.72 82.72 ± 5.24 76.32 ± 9.05 91.21 ± 3.33 75.6 ± 16.57 96.86 ± 2.86 86.33 ± 4.84 88.43 ± 2.7 82.21 ± 20.45
96.88 ± 0.21 99.79 ± 0.07 95.29 ± 0.68 95.83 ± 2.4 87.78 ± 6.12 95.16 ± 1.2 99.45 ± 0.28 98.5 ± 0.41 79.02 ± 4.22 95.58 ± 1.12 96.1 ± 1.19 80.31 ± 3.09 82.51 ± 5 99.25 ± 0.75 91.67 ± 3.17 99.73 ± 0.34 84.08 ± 4.42 96.95 ± 2.14 83.32 ± 7.57 93.4 ± 4.79 78.88 ± 1.76 84.05 ± 0.6 68.24 ± 11.93 90.85 ± 5.12 71.08 ± 0.64 80.89 ± 5.9 84.17 ± 9.52 89.84 ± 5.09 87.25 ± 11.93 98.64 ± 2.17 90.01 ± 1.71 89.48 ± 2.37 97.5 ± 8.58
Mean
90.31 ± 3.58
89.82 ± 3.72
85.51 ± 4.37
89.74 ± 3.53
898
H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899
(2) The CITree algorithm also outperforms C4.4 significantly in terms of AUC: It wins in 19 data sets, ties in 14 data sets, and loses in 0 data sets. The average AUC for decision trees is 85.81%, lower than CITreeÕs. (3) The CITree algorithm perform better than NBTree in terms of AUC: It wins in 4 data sets, ties in 29 data sets, and loses in 0 data sets. The average AUC for NBTree is 89.82%, lower than CITreeÕs. Table 3 shows the tree size and training time obtained by the three tree learning algorithms. Notice that C4.4 is much more efficient than both NBTree and CITree. Thus, we do not include the running time of C4.4. From Table 3, we can see that the CITree learning algorithm is more efficient than NBTree and the size of the CITrees is smaller than NBTree and C4.4. Some detailed observations are summarized below (Table 4). (1) The tree size for CITree is significantly smaller than the tree size for C4.4 over most of these data sets. Here the size of a tree is the number of nodes. The
Table 4 Results of two-tailed t-test on AUC
AUC-CITree NB C4.4
NB
C4.4
NBTree
9-24-0
19-14-0 22-6-5
4-29-0 2-23-8 1-13-19
Note: Each entry w/t/l means that the algorithm in the corresponding row wins in w data sets, ties in t data sets, and loses in l data sets, compared to the algorithm in the corresponding column.
total tree size for CITree is 477, and for C4.4 it is 28,356. Notice that pruning is not suitable to C4.4. The basic idea of C4.4 is to obtain a large tree and then use the Laplace correction to smooth the probability estimates. According to Provost and Domingos (2003), pruning damages the probability estimation of traditional decision trees. (2) The tree size for CITree is also significantly smaller than the tree size for NBTree over most of these data sets. The total tree size for NBTree is 2158. For NBTree, it avoids to produce a large tree by early stop, instead of pruning. But it essentially prefers a small tree.
Table 3 Experimental results on the tree size and training time (s) Data set Letter Mushroom Waveform Sick Hypothyroid Chess end-game Splice Segment German credit Vowel Anneal Vehicle Pima Indians diabetes Wisconsin-breast-cancer Credit approval Soybean Balance-scale Vote Horse colic Ionosphere Primary-tumor Heart-c Breast cancer Heart-statlog Audiology Glass Sonar Autos Hepatitis domain Iris Lymph Zoo Labor Total
CITree(S)
NBTree(S)
C4.4(S)
CITree(T)
NBTree(T)
18 20 1 58 10 56 1 12 1 22 11 94 1 2 8 4 1 18 29 28 2 1 7 1 9 12 11 15 6 7 2 6 3
1298 26 57 63 5 47 3 121 16 64 43 123 8 2 14 37 1 17 28 13 11 11 11 13 23 29 13 25 10 7 9 7 4
14,162 30 4161 359 1463 88 588 759 800 899 100 937 689 149 429 131 308 47 210 164 196 203 230 271 93 276 154 216 76 55 73 23 21
52.63 5.01 7.36 7 7 12.73 9.61 2.73 0.34 0.99 2.3 0.88 0.06 0.07 0.27 4.44 0.02 0.36 0.45 0.63 0.53 0.08 0.05 0.05 5.21 0.09 0.78 0.37 0.12 0.01 0.08 0.11 0.02
246.11 6.73 64.26 16.26 6.29 20.71 13.49 12.35 2.53 8.4 9.52 9.57 0.55 0.56 2.21 24.04 0.1 1.67 5.42 7.69 2.49 1.3 0.52 0.83 28.88 0.81 15.56 4.12 1.74 0.15 1.42 0.92 0.51
477
2158
28,356
122.38
517.71
Note: CITree(S), NBTree(S), and C4.4(S) represent the tree size obtained by the corresponding algorithm, and CITree(T) and NBTree(T) represent the corresponding training time.
H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899
(3) The training time for CITree is significantly faster than NBTree over most of these data sets. The total training time for CITree is 123 s, and for NBTree it is 517 s.
5. Conclusions In this paper, we extend the traditional decision tree model to represent accurate probabilities in order to yield accurate ranking or high AUC. We propose a model CITree, the structure of which explicitly represents conditional independencies among attributes. CITrees are more expressive than naive Bayes and more compact than decision trees. We present and implement a novel AUC-based learning algorithm AUC-CITree to build a CITree for ranking by exploring the conditional independencies among attributes, different from traditional decision tree learning algorithms. Our experiments show that the AUC-CITree algorithm performs better than C4.4, naive Bayes, and NBTree in AUC. In addition, the AUC-CITree algorithm is more efficient and produces smaller trees compared to NBTree. CITree can be viewed as a bridge between probabilistic models, such as Bayesian networks, and non-parametric models, such as decision trees. However, a more effective CITree learning algorithm is desired. Currently, our learning algorithm is based on cross-validation. We believe that if a better learning algorithm is found, a CITree will benefit much from its structure, and thus will be a good model for applications. References Buntine, W., 1991. Learning Classification Trees. Artificial Intelligence Frontiers in Statistics. Chapman and Hall, London, pp. 182–201. Domingos, P., Pazzani, M., 1997. Beyond independence: conditions for the optimality of the simple Bayesian classifier. Machine Learn. 29, 103–130. Ferri, C., Flach, P.A., Hernandez-Orallo, J., 2002. Learning decision trees using the area under the ROC curve. In: Proc. of the 19th Internat.
899
Conf. on Machine Learning. Morgan Kaufmann, Los Altos, CA, pp. 139–146. Ferri, C., Flach, A.P., Hernandez-Orallo, J., 2003. Improving the AUC of probabilistic estimation trees. In: Proc. of the 14th European Conf. on Machine Learning. Springer, Berlin, pp. 121–132. Hand, D.J., Till, R.J., 2001. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learn. 45, 171–186. Kohavi, R., 1996. Scaling up the accuracy of naive-Bayes classifiers: a decision-tree hybrid. In: Proc. of the Second Internat. Conf. on Knowledge Discovery and Data Mining (KDD-96). AAAI Press, pp. 202–207. Ling, C.X., Yan, R.J., 2003. Decision tree with better ranking. In: Proc. of the 20th Internat. Conf. on Machine Learning. Morgan Kaufmann, Los Altos, CA, pp. 480–487. Merz, C., Murphy, P., Aha, D., 1997. UCI repository of machine learning databases. Dept of ICS, University of California, Irvine. Available from:
. Pagallo, G., Haussler, D., 1990. Boolean feature discovery in empirical learning. Machine Learn. 5 (1), 71–100. Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C., 1994. Reducing misclassification costs. In: Proc. of the 11th Internat. Conf. on Machine Learning. Morgan Kaufmann, Los Altos, CA, pp. 217–225. Provost, F.J., Domingos, P., 2003. Tree induction for probability-based ranking. Machine Learn. 52 (3), 199–215. Provost, F., Fawcett, T., 1997. Analysis and visualization of classifier performance: comparison under imprecise class and cost distribution. In: Proc. of the Third Internat. Conf. on Knowledge Discovery and Data Mining. AAAI Press, pp. 43–48. Provost, F., Fawcett, T., Kohavi, R., 1998. The case against accuracy estimation for comparing induction algorithms. In: Proc. of the Fifteenth Internat. Conf. on Machine Learning. Morgan Kaufmann, Los Altos, CA, pp. 445–453. Swets, J., 1988. Measuring the accuracy of diagnostic systems. Science 240, 1285–1293. Symth, P., Gray, A., Fayyad, U., 1996. Retrofitting decision tree classifiers using kernel density estimation. In: Proc. of the Twelfth Internat. Conf. on Machine Learning. Morgan Kaufmann, Los Altos, CA, pp. 506–514. Witten, I.H., Frank, E., 2000. Data Mining—Practical Machine Learning Tools and Techniques with Java Implementation. Morgan Kaufmann, Los Altos, CA.
Further reading Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.