Knowledge-Based 5VSTEM~-"-ELSEVIER
Knowledge-Based Systems 10 (1998) 421-430
Constructing conjunctions using systematic search on decision trees Zijian Zheng School ~/' Computing and Mathematics, l}eakm University, Geelong, Victoria 3217, Australia
Received I I August 1997: accepted 3 November 1997
Abstract This paper investigates a dynamic path-based method for constructing conjunctions as new attributes for decision tree learning. It searches for conditions (attribute-value pairs) from paths to form new attributes. Compared with other hypothesis-driven new attribute construction methods, the new idea of this method is that it carries out a systematic search with pruning over each path of a tree to select conditions for generating a conjunction. Therefore, conditions for constructing new attributes are dynamically decided during the search. Empirically, evaluation in a set of artificial and real-world domains shows that the dynamic path-based method can improve the performance of selective decision tree learning in terms of both higher prediction accuracy and lower theory complexity. In addition, it shows some perfommnce advantages over a fixed path-based method and a fixed rule-based method for learning decision trees. © 1998 Elsevier Science BV. All rights reserved. Kevwords: Knowledge discovery; Machine learning; Constructive induction
1. Introduction Machine learning, especially inductive learning, plays an important role in knowledge discovery in databases (KDD). Constructive induction has been studied as a means to improve the quality of theories learned by inductive learning, Constructive induction algorithms construct new attributes from primitive attributes that are used to describe examples. The former are expected to be more appropriate than the latter for representing theories to be learned. People have explored lixed path-based and fixed rule-based approaches to constructing new binary attributes fl)r decision tree learning [1-3]. The number and positions of conditions for generating a new attribute in a path of a tree or a rule are predecided (fixed) in these algorithms. It has been shown that the number and locations of conditions in a path of a tree or a rule which are used to form a conjunction affect the pertk~rmance of this type of constructive induction algorithm [1,3]. This paper presents a novel method of constructing new attributes by pertorming a search over a path of a tree. It can find relevant conditions to form the best conjunction that can be created from the path in terms of an attribute evaluation function. It does not need to predecide the number and positions of conditions for generating new attributes. The conditions in a path used to construct a new attribute are dynamically selected. The number of conditions used to generate a new attribute is also 0950-7051/98/$19.00 <~ 1998 Elsevier Science B.V. All rights reserved. PII: S0950-7051(98'100036-7
dynamically decided. Therefore, this approach is referred to as the dynamic path-based approach. The CAT algorithm is an implementation of this approach. CAT employs a systematic search method with pruning [4]. It considers all possible combinations of conditions in a path and is efficient in practice as some parts of the search space are eliminated during search. This search method is due to W e b b ' s work on rule learning [41.
2. The CAT algorithm Like the FRINGE family of algorithms 12,5,6] and the CI algorithms [3 ], CAT is also a hypothesis-driven constructive induction algorithm for learning multivariate trees. It constructs new binary attributes by using the dynamic path-based method over previously learned decision trees. Its constructive operators are conjunction and negation (implicitly). 2.1. C o n t r o l s t r u c t u r e o f C A T
CAT creates new attributes from paths of a tree. It iterates a tree learning process (using C4.5) and a new attribute construction process. The whole process consists of two stages. In each iteration of the first stage, CAT learns a raw tree and a pruned tree based on relevant or important primitive
422
/. /hen,k,/Kmm'led~,e-Ba.sed Svstem~ I0 (1998) 421 430
attributes identified in the previous iteration. Note that all the primitive attributes are used in the first iteration. CAT constructs new attributes from the raw tree. It then builds two other pruned trees and identifies relevant or important primitive attributes. One pruned tree is based on both primitive and new attributes, while the other is based only on new attributes. Those primitive attributes that occur (directly or through new attributes) in the better one of these two trees are considered to be relevant or important, and are transferred to the next iteration. During this stage, new attributes contain only primitive attributes. After a stopping criterion is satisfied, CAT chooses the best tree among all the pruned trees that have been built. The relevant primitive attributes used in the iteration where the best tree occurs are used as the primitive attributes for the second stage. In each iteration of the second stage, C A T builds a raw tree and a pruned tree based on existing attributes, including both primitive attributes and previously created new attributes. New attributes are constructed from the raw tree, and are used for building decision trees in subsequent iterations. They may consist of both primitive and new attributes. Finally, C A T selects the best pruned tree from the two stages as its output. C A T uses an MDL-inspired heuristic flmction as its tree evaluation function [7]. The function is similar to the coding cost function used by Quinlan and Rivest [8], but exceptions of a tree on the training set are replaced by the more pessimistically estimated exceptions of the tree [9]. In addition, new attributes are encoded 17]. In the current implementation, CA]" adopts the following stopping criterion: • • •
No new attribute can be constructed, or No better pruned tree has been built in five consecutive iterations, or A given maximum iteration number is reached (the default value is 20).
Note that five consecutive iterations without producing better pruned trees are allowed to avoid a local optimum tree. The arbitrary number 'five' is just a default option setting of the algorithm. Each path of a decision tree labels one class. A new attribute, a conjunction of conditions, constructed from this path is expected to be good at discriminating this class flom other class(es). However, some paths cover no training examples or fewer examples of their labelled classes than examples of other classes, which might occur for multiple class problems. In this case, the evidence that a path predicts its labelled class is not strong. Therefore, C A T constructs one new attribute from each path that covers more examples of its labelled class than examples of other class(es).
as the positive class and all other classes are referred to as the negative class. All problems are temporarily transformed into binary class problems when constructing new attributes. Note that, when building and evaluating decision trees alter new attributes have been constructed, the learning problems are still tile original ones with original classes. Systematic search with pruning has the same outcome as exhaustive search, but it does not examine every state of the search space although it considers all the states. Some portions of the space can be eliminated as search progresses. To do this, all the states are organised into a tree structure such that every state can be visited once and only once by traversing the tree. In addition, the evaluation function should be able to tell whether or not a better state could be found by searching the subtrees of a node. For the purpose of searching for the best conjunction of conditions froln a path in terms of an evaluation function, the search space consists of all possible conjunctions that can be constructed from the path. Since the order of conditions of a conjunction does not matter, a search tree can be generated as follows. Its root is an empty conjunction that contains no conditions and is always true. Each possible direct specialisation of a node that is derived from the node by adding one condition forms a child. All the children of a node are created in some order such that, when generating their subtrees, each child does not use the conditions that are most recently added into those children that tire created belkne it. This makes sure that each conjunction occurs once and only once in the search tree. One simple way to order the children of a node is to use the sequence of the conditions, given that the attributes and their values are ordered when used to describe data. A condition that can be added to a coniunction must have a higher order than existing conditions of the conjunction. For example, given tour conditions and their order as: A, B, C and D, Fig. 1 shows a search tree. However, following the idea of OPus 141, C A T uses an information-based evaluation function (different from that of Opt:s) to order the children of a node when generating them. The objective is to increase the size of the regions that can be pruned. The new attribute evaluation function used by CAT lot carrying out systematic search is information gain. It /A^B^C
A^B^C^D
A B ~....A^B^D /A~A~C
~
2.2. Systematic search with pruning.!or constructing conjunctions
- -
At'CAD
~.~B^C B "'BAD C
B^C^D
CAD
D
Fig. I. A search tree fl)r constructing conjunctions from conditions A, B, ("
Here, the class labelled by the leaf of a path is referred to
and D.
Z. Zheng/Knowledge-Based S ystenls I 0 ( 199Y, J 4 2 1 - 4 3 0
satisfies the requirement for performing systematic search with pruning. In addition, its disadvantage of favouring attributes with more different values [9] does not show up here since all new attributes are binary. Supposing a conjunction covers pos out of Pos positive examples and neg out of Neg negative examples in a training set, it is not difficul! to prove that gain(Pos, Neg, pos, neg). defined as in Eq. ( 1). is monotonic with respect to either pos or neg when pos/Pos > neg/Neg. The function gain(Pos, Neg, l,OS, neg) decreases when pos decreases and increases when neg decreases under the condition p o s / P o s > n e g / N e g , which is a reasonable restriction because a conjunction, as a candidate new attribute, is expected to cover (relatively) more positive examples than negative examples. The examples covered by a conjunction are referred to as the cover of the conjunction. The positive examples and the negative examples in the cover are referred to respectively as the positive cover and the negative cot,er oi' the conjunction. Adding conditions to a conjunction usually reduces its cover, including positive cover, or negative cover, or both. The best thing that can be expected is that it reduces only the negative cover, thus increasing the gain value. For a given candidate new attribute CandNew. pos is its positive cover. ConjCond is the conjunction of all possible conditions that can be added to CandNew. Supposing neg' is the negative cover of (CandNew A CotzjCond), gain(Pos, Neg, pos, neg') gives the upper bound of the gain values of the new attributes that can be derived from CandNew. It is referred to as Potential-Gain of CandNew when describing the CAT algorithm. If it is not higher than the gain value of the current best new attribute, further searching in the subtree rooted at CandNew is not necessary. gain(Pos, Neg, pos, n e g ) = inh)(Pos, Neg) pro + neg × info(pos, Pos + Neg +
ile~, )
Pos - pos + Neg - neg
P o s + Neg ×
info(Po.~ - pos, Neg - n e g ) ) , /
infobo, n)
-
-
P
p+tz
p+ n
(l)
× log, ( p ~ n )
(2)
Table 1 details the method of systematic search with pruning used by CAT. The data structure for a candidate new attribute, Cand, contains three parts: Cand.Cs: its conditions; Cand.PCs: a set of potential conditions that can be added to Cand when extending it; and Cand.LastC: the condition that is most recently added to Cand. Given a path consisting of conditions and its labelled class as well as
423
Table I Systematic search with pruning for constructing a new attribute from a path Systematic-Search-Pruning(Path, D ,,~,,,,,0 INPUT: Path: all conditions in the path and its labelled class. D,~,i,,,,~,: training set. OUTPUT: a conjunctinn as a new attribute. Best.('~': = true Best.P('~: = {all conditions in P~,,IJ1} Candidates: = {Best} WHILE (Candi&ttes has at least one candidate whose P o t e n t i a l - G a i n ( ) > Gain(Best.Cs) i DO { Remo~e the candidate that maximises ( P o t e n t i a l - G a i n ( ) from ('andidates and assign it to ( ' a n d TempCandidates: = { } F O R each condition Cond in Cand.P('~ { New.('s: = ('atM.Cs A C o n d New.Last(': = ('(md IF I ( R e l P o s C o v e r t New.Us) > RelNegCo~er{Ne~t.C~)) A (Gain(New.Cs) > Gain(Be~t.Cs))) T H E N Best: = N e w IF ((PotentiaI-Gain(New.Cs, (_'and.PCs - Corot) > GainfBest.Cs)) A (not C a n n o t - l m p r o v e ( C a n d . C ~ , N e n ( ' ~ . Cand.PC~ Cond)) T H E N Add N e w to 7~mp(.~mdi<es E l S E Remove Cond from Cand.P('~
} F O R each candidate N e w in 7empCandi~Mte~ in ascending order wrt. P o t e n t i a I - G a i n t r { Remove N e w . L a s t C from Cand.P('s Nen'.PC~: = ('and.PC.~ IF ( P o t e n t i a l - G a i n ( Ne w. ( "~, Ne ~,. P~ "v) , { ;ai n( B~,st. Cv ) ) T H E N Add N e w to Camlidates
} I R E T U R N Best.('s'
the entire training set, the algorithm systematically searches through all possible conjunctions that can be generated from the path. To do so, it gradually extends an empty conjunction to create candidate new attributes by adding one possible condition of the path each time. Some parts of the search space are pruned during search. The algorithm starts from a candidate new attribute with no conditions (always true) in it. Its potential condition set contains all conditions of the path. It is used as the initial best new attribute, Best, and the only initial member of the set of candidate new attributes, Candidates. When there are still some candidates worth extending, the one most likely to lead to the best new attribute is first chosen to extend. This is done by inspecting Candidates to see whether it contains at least one candidate with a higher potential gain value than the gain value of the current best new attribute. If so, the candidate, ('and, with the highest potential gain value is removed from Candidates. The algorithm extends Cand to create specialised candidates by adding each of its potential conditions in turn. Each extended new attribute is examined to see whether it can be the current best new attribute and whether it is worth exploring further. If an extended new attribute, New, has a higher gain value than Best and covers relatively more positive examples than negative examples, it becomes the current best new attribute. If N e w ' s potential
424
z Zheng/Knowledge-Based Systems 10 (1998) 421-430
gain value is higher than the gain value of Best and the gain value of every new attribute that can be derived from Cand but without considering the condition New.LastC, it is added to a temporary set of candidate new attributes, TempCandidates. Function ' C a n n o t - I m p r o v e ( Cand.Cs, New.Cs, Cand.PCs - Cond)' returns true if 'PotentialGain ( Cand.Cs, Cand.PCs - Cond) ~> PotentialGain(New.Cs, Cand.PCs - Cond)', where New.Cs is derived by adding Cond to Cand.Cs. This means that any extension of New.Cs is worse than, or as good as, some extensions of Cand.Cs without considering Cond. After every potential condition in Cand.PCs is examined, the algorithm generates children of the node Cand in the search tree using candidate new attributes that are worth further extension. It examines each candidate, New, in TempCandidates in ascending order with respect to potential gain. The condition New.LastC is removed from Cand.PCs, and the resulting conditions form the set of potential conditions for New. If New has a higher potential gain value than the gain value of Best, it is added to Candidates for further extension later; otherwise, it is pruned from further exploration as it cannot lead to a better new attribute. Note that the potential conditions of any child of Cand do not contain conditions most recently added to the children of Cand that are generated earlier. Finally, when Candidates contains no candidate new attribute that is worth extending further, Best contains the best new attribute. It is worth mentioning that functions 'Gain()',
'Potential-Gain()', 'Cannot-Improve( )', 'RelPosCover()' and ' R e l N e g C o v e r 0 ' need D,ra~,i,~ and the class labelled by the path to compute their values, hut here they are omitted for the sake of simplicity. The worst case time complexity of the 'SystematicSearch-Pruning' algorithm is exponential on the size of a path (it is linear in the size of a training set), but it is quite efficient in practice when solving learning problems. The reason is that considerable portions of search spaces are eliminated. For example, in the Nettalk(Stress) domain, when constructing new attributes from the first tree, the search space contains 41 728.6 states on an average of the ten trials, whereas only 5033.8 states are really examined. More than 87.9% of the search space is pruned. It is even better in the Cleveland Heart Disease domain. The search space contains, on average, 93 563.3 states when constructing new attributes from the first tree, whereas only 732.3 states are really examined. More than 99.2% of the search space is pruned.
3. Experiments This section uses experiments to evaluate the CAT algorithm by comparing it with the C4.5 [9], SFRIN6E [7] and CI3 [7] algorithms in a set of artificial and real-world domains. Here, we focus on prediction accuracy and theory complexity. The theory complexity [10] is the modified tree
size. It is the sum of the sizes of all the nodes of the tree rather than the number of decision nodes or all the nodes of the tree. The size of a leaf is 1. The size of a decision node is I for a univariate tree, and is the number of attribute-value pairs, or conditions, in the test of the node for a multivariate tree. Computational requirements will be briefly addressed in the final subsection. SFRINGE is a member of the FmNGE lhmily of hypothesisdriven constructive decision tree learning algorithms [5]. It follows the idea of SYMFmNGE [6] with a straightforward extension. For each leaf, it constructs one new attribute using the conjunction of two conditions at the parent and grandparent nodes of the leaf. SFRINGE adopts the fixed path-based strategy. C|3 is also a hypothesis-driven constructive decision tree learning algorithm, but it adopts the fixed rule-based strategy. CI3 creates new attributes from production rules that are transformed from a decision tree. For each rule, it uses the conjunction of two conditions near the root of the tree as a new attribute (default option setting of the algorithm). Note that SFRINGE, C13 and CAT use the same decision tree building and pruning method, tree evaluation function, and stopping criterion. 3.1. Experimental domains and methods
Fourteen artificial logical domains are from Pagallo [5]. They cover a variety of well-studied artificial logical concepts in the machine learning community: randomly generated boolean concepts including DNF and CNF concepts, multiplexor concepts, parity concepts and majority concepts. We adopt the same experimental method as used by Pagallo [5], including the sizes of training and test sets. For each experiment, a training set and a test set are independently drawn from the uniform distribution. Experiments are repeated ten times in each of these domains. Besides the fourteen artificial logical domains, three Monks domains [12] and ten domains from the UCI repository of machine learning databases [11] are used. The ten UCI domains consist of five medical domains (Cleveland Heart Disease, Hepatitis, Liver Disorders, Pima Indians Diabetes, Wisconsin Breast Cancer), one molecular biology domain (Promoters), three linguistics domains (Nettalk(Phoneme), Nettalk(Stress), Nettalk(Letter)), and one game domain (Tic-Tac-Toe). In the three Nettalk domains, we use the 1000 most common English words, containing 5438 letters. These ten domains cover the spectrum of properties such as dataset size, attribute types and numbers, the number of different nominal attribute values, and the number of classes. For each UCI domain, a 10-fold cross-validation is conducted on the entire data set and all algorithms are run on the same partitions. For each Monks domain, one trial is carried out using the fixed training and test sets because they are provided by the problem designers [ 121. To compare the accuracies of two algorithms in a domain,
Z Zheng/Knowledge-Based Systems 10 (1998) 421-430
425
Table 2 Results of C4.5. StZRINGE, CI3 and CAT Domain
Accuracy I%) C4.5
DNFI DNF2 DNF3 DNF4 CNFI CNF2 CNF3 CNF4 MX6 MX 11 Parity4 Parity5 Maj I I Maj 13 Monks I Monks2 Monks3 Heart Disease Hepatitis Liver Disorders Diabetes Breast cancer Promoters Phoneme Stress Letter Tic-Tac-Toe
87.2 90.7 03.8 74.0 ~6.9 ~)0.6 ~)3.3 7 29 I G)0.0 't7.2 ~G.5 52.2 ~2.9 76.3 v5,7 ~5.0 97.2 73.3 v82 62. I ~ 1.5 ()4.8 76.3 8 I. 1 ;";2 7 737 ,~4.7
Complexity SFRINGE 97.1 99.6 99.5 100.0 96.5 99.3 99.7 99.4 100.0 100.0 I00.0 @ 87.3 92.7 @82.7 100.0 @64. I ¢~97.2 75.9 80,0 64.9 71.5 95.4 78.1 83.7 @85.8 @ 77.8 97.6
C13 100.0 99.9 99.9 100.0 100.0 99.5 99.7 100.0 100.0 1t30.0 100.0 74.7 92.8 86.5 100.0 @67.1 @95.8 75.5 82.7 63.4 73.7 96.0 82.9 82.4 @86.2
@66.8 98.4
CAT 99.4 99.2 99.4 99.2 99.4 99.2 99.3 99.1 100.0 100.0 99.8 65.6 92.5 85.7 100.0 75.9 100.0 75.9 82.0 67.0 73.0 95.0 86.9 83.5 88.0 74.6 98.3
a two-tailed block-based pairwise t-test is conducted. An instance-based pairwise sign-test is used in each Monks domain since only one trial is conducted. A difference is considered as significant if the significance level is above 95%. In tables summarising accuracies, boldface font indicates that an algorithm is significantly better than C4.5, while italic font indicates that an algorithm is significantly worse than C4.5.
3.2. Experimental results Table 2 shows the prediction accuracies and theory complexities of C4.5, SFRIN6E, CI3 and C A T in the artificial, the Monks and the UCI domains. We first compare C A T with C4.5. The table shows that C A T obtains significantly more accurate trees than C4.5 in all the artificial and Monks domains, except for the MX6 domain in which both C4.5 and C A T have a 100% accuracy. In twelve out of these seventeen domains, C A T builds much simpler trees than C4.5. It is worth mentioning that in the DNF4 domain, C A T learns a very complex tree in one of the ten trials because it creates some poor new attributes. This results in C A T having a large theory complexity in this domain. In the Parity5 domain, C A T also constructs some large new attributes, thus building trees with high complexity. In all the ten UCI domains, C A T has higher accuracies than C4.5. Five accuracy improvements are significant. In terms of
C4.5
SF;RINGE
CI3
263.0 202.0 101.8 525.0 271.2 192.8 94.4 532.8 48.6 168.6 238.4 1339.4 461.6 527.4 18.0 31.0 12.0 49.8 13.6 79.4 128.8 20.6 22.6 2339.2 2077.3 3394.9 128.5
125.3 45.5 37.9 48.2 140.6 36.0 39.9 71.6 15.8 55.1 38.4 445.2 725.1 692.6 I 1.0 13.0 14.0 45.6 10.8 75,4 128.8 14.9 14.0 1176.7 858.6 1821.5 69.3
60.7 70.6 43,4 43.7 55.3 62.3 50.8 44.0 15.4 76 1 47.t~ 351.3 10767 1496.7 12,(I 22.0 12.1) 20.5 140 443 25. ~ 14. I I1.8 1615 t) 812. t) 11725 3 I.~
CAT 69 I 475
45.2 2480.fi 61.2 34.~) 30.5 98.5 14.0 M.I) 42.9 2022.9 1538.9 916.5 9.t) 73.t) 6.0 8 I. I 12.2 152.7 64.5 28.'; II 5) 876. I 372.4 762.11 26.l)
theory complexity, trees generated by C A T are less complex than those built by C4.5 in seven domains and are more complex than those built by C4.5 ill the other three domains. To illustrate the effect of the constructive induction in C A T on generated decision trees, we compare decision trees created by C A T with those built by C4.5. We use the Tic-Tac-Toe domain as an example, since the target concept of Tic-Tac-Toe is known. Appendix A shows a pruned tree built by C4.5 on 862 randomly selected training examples (90% of the whole dataset). It has 127 nodes, and an error rate of 15.6% on the remaining 96 examples. Appendix B gives a decision tree created by C A T on the same training set as well as new attributes constructed by C A T and used in this tree. This tree has only three nodes. The first eight new attributes used in the last new attribute that appears in the tree represent all the eight cases of the Tic-Tac-Toe end game that are "win for o'. This tree represents the concept 'not win lbr o', which is very close to the target concept 'win for x'. Note that the cases that are 'not win for o' include both the cases that are "win for x' and the cases that are a draw. The tree built by C A T has an error rate of 2.1% on the same test set. In addition, it can be seen that the tree generated by C A T is much easier to understand than that built by C4.5. The former uses some meaningful subconcepts that are represented by the new attributes, and is smaller than the latter.
426
Z Zhen~/Knowledge-Bt sed Systems I0 (1998) 421 430
100
140 T i c - ' l h c - T o e
Tic-Tac-Toe
95
'""
•-'1 . . . . . .I "
120
"-" 90 85 80 u
75
40
7O 66
CAT
20 II
100 200 300 400 500 600 700 800 900
100 200 300 400 500 600 700 800 900
training set size
training set size Nettalk(Stress)
Nettalk(Stress) 85
~
80 75
...
~
}
I"
4000
:
~'-
v
70
65 60 55
50
45 41
C4.5 CI3
/
t;
]
.~
t
~
----i
'SFRiNC~
~ " ~ - -~ ~
20
40
60
CAT 80
100 120 140 160
training set size (x 10)
3500 ~ 3000 "~ 2500 t 2000 ~ 1500
C4.5
~i'3-
I
1000 500 11
_ . . . . .
20
40
60
80
. .:=.7.7.~7.7
.
100 120 140 160
training set size (x 10)
Fig. 2. Learning curves for CAT in the Tic-Tac-Toc and Nettalk(Stress) domains. These results illustrate that C A T can improve the performance of C4.5 in terms of higher prediction accuracy as well as lower theory complexity. Now let us consider the comparison of C A T with SFRINGE and CI3. In the artificial logical domains, as shown in Table 2, the accuracies of SFRINGE, CI3 and C A T are quite similar. C A T generates significantly more accurate trees than SFRINGE in one domain ( M a j l 3 ) which is indicated by ' G ' , while SFRINGE learns significantly more accurate trees than C A T also in one domain (Parity5), labelled by ' @ ' . No other accuracy difference between C A T and SFR1NGE or 0 3 in these domains is significant. The theory complexities of C A T are lower than those of SFRINGE in six out of the fourteen domains, and are higher than those of SFRINGE in the other eight domains. Compared with CI3, C A T achieves lower theory complexities in half of the fourteen domains, and obtains higher theory complexities in the other half of the domains. As shown in Table 2, C A T obtains the highest prediction accuracies in all the three Monks domains among these four learning algorithms. In the Monks2 and Monks3 domains, C A T is significantly more accurate than both SFRINGE and C13. This is indicated using ' O ' in the table. In the Monks l and Monks3 domains, C A T learns smaller trees than SFRINGE and CI3. Monks2 is an exactly M-of-N concept. Conjunctions are not appropriate new attributes for this domain, so the SFRINGE, CI3 and C A T algorithms cannot
learn a really good decision tree, whatever new attribute construction methods are used. It is similar in the parity and majority domains. The results in the Monks domains illustrate the performance advantage of C A T over SFRINOE and CI3 in terms of higher prediction accuracy. In the ten UCI domains, as far as prediction accuracy is concerned, C A T is better than SFRINGE in six domains, the same in one domain, and worse in the other three domains. Among the accuracy differences between CAT and SFRINGE, only in the Nettalk(Stress) domain is CAT significantly better than SFRINGE, and only in the Nettalk(Letter) domain is C A T significantly worse than SFRINGE. In terms of theory complexity, C A T is better than SFR1NC,E in six domains and worse in the other tour domains. Compared with CI3, C A T obtains higher accuracies in six domains and lower accuracies in the other four domains. Only in the Nettalk(Stress) and Nettalk(Letter) domains does C A T achieve significantly higher accuracies than C13. The theory complexities of C A T are lower than those of CI3 in five domains, and are higher than those of C13 in the other five domains. In summary, the performance of C A T is, on average, slightly better than that of SFRINGE and C13 in terms of prediction accuracy in the set of domains tested. Compared with SFRINGE, C A T is significantly more accurate in four domains and is significantly less accurate in two domains. C A T is significantly more accurate than CI3 in four
427
Z. Ztwng/Knowledge-Ilased System.s 10 (I 998i 421 -130
70 60
8 50 09
40
4.a
30
/
C4.5 CI3 'Si~R_iNaE
/ ~," /"
J
~..,: : /// //
20 10
Nettalk(Stress)
Tic-Tac-Toe
.- ..
......... ........
~ ~ ~ &~ .~ ~ ~ v
0
100 200 300 400 500 600 700 800 900 training set size
5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0
C4.5
CI3 SFR,NGE i CAT
20
40
60 80 100 120 140 training set size ( x l 0 )
160
Fig. 3. Time complexiD, ol CAT. domains. In terms of theory complexity, C A T is less complex in fourteen domains and is more complex in thirteen domains than both SFRIN¢;~.: and C13. 3.,?. Learnin~ on is that the target conccpt itself is complex. When more training examples are available, more complex theories are generated. 3.4. ( ' O m l m t a t , m a l reqttirctmml.s
The most expensi,~e part of the C A T algorithm is the systematic search over paths, although pruning can reduce the search time. The depths of trees have a great effect on the execution time of CAT. Fig. 3 displays the execution time of C A T when the training set size increases in the
Tic-Tac-Toe and Nettalk(Stress) domains. (74.5. SFRIN(;E and C13 are used as references. In the Tic-Tac-Toe domain, the execution time of C A T grows slowly since the tree sizes do not increase after the training set size reaches 500. It can also be seen that C A T uses less time than CI3 for all the training set sizes in this domain. However, m the Nettalk(Stress) domain, the execution time of ( ' A T increases quickly. One possible reason is the increased depths of trees u>ed bx C A T for generating new attributes. Note that the theory complexities reported in the previous subsection are different from the depths of trees. The number of decision nodes of a tree, especially the depth of paths, rather than the theory complexity o f the tree affects the execution time of CAT.
4. D i s c u s s i o n
The new attribute search space lk~r C A T is a superset of that for C13 and SFRIN~m. Howe~er, our experimental results show that C A T cannot achieve the higher of the accuracies obtained by C13 and SFRIN(;V. in quite a few domains tested, although the accuracy differences are not large. One reason might be that inlormation gain is used as the heuristic function to evaluate new attributes. Good new' attributes cannot always be selected e~en if they arc examined. On the ()the. hand, this can be explained as the oversearching phenomenon [ 131.
5. Related w o r k
As fat as constructing new attributes for decision tree learning is concerned, related work includes the FRINGI~ family of algorithms such as FRINGE, DUAI. FRINGE, SYMMEIRI(" FRINGE 12,51, SYMFRIN(iE. DCFRIN{iE [6] and SFRINGE 171, the C n r e algorithm 11,141. the CI algorithms [3,71. the LFC algorithm [151, the ID2-of-3 algorithm [161, the LMDI algorithm [I 7] and the Xol.'N algorithm [10]. They use different constructive operators and different strategies to create new attributes.
428
Z. Zheng/Knowledge-Based System~' I0 (1998) 421-430
As far as systematic search is concerned, the closest related work is Oeus [4]. Some ideas in CAT about the systematic search with pruning are taken from it. OPus carries out systematic search with pruning over the space of all possible disjuncts at the inner level of a covering rule learning algorithm. The method of systematic search with pruning used in CAT is very similar to that in Oeus. The main difference is that CAT uses information gain as the evaluation function, while Oeus employs the Laplace function [4]. In addition, CAT searches for a conjunction as a new attribute for learning a decision tree, whereas Oeus searches for a conjunction as a rule for learning a set of unordered rules. Rymon [18,19] uses systematic search with pruning to learn SE-trees, a type of tree structure containing rules that predict classes using attributes. Schlimmer [20] adopts systematic search with pruning for inducing determinations that identify which factors influence others. Webb [21,22] further explores the systematic search in a more general way. In addition, Webb [22] proposes a few new pruning rules for systematic search.
6. Conclusions and future work This paper has investigated a dynamic path-based approach to the construction of new attributes for decision tree learning. For each path of a tree, a conjunction is generated by searching over the path. A systematic search method with pruning has been explored. The advantage of this approach is that the number and locations of conditions lbr creating a conjunction are dynamically decided. Irrelevant conditions should be filtered out from new attributes. In the current implementation of CAT, systematic search with exclusive pruning is used. Inclusive pruning [22] may further reduce execution time. The principal idea of the dynamic path-based method is to generate conjunctions by carrying out search over every path of a tree instead of employing the fixed strategy to create new attributes by using conditions from predefined positions of each path. Alternative search methods such as simple and less expensive greedy search are worth exploring, especially considering the oversearching phenomenon [13]. The experiments found that the dynamic path-based method can significantly improve the performance of decision tree learning in most of the domains studied in terms of both higher prediction accuracy and lower theory complexity. In no case does CAT reduce the accuracy of C4.5. Comparison with SFRINGE and CI3 reveals that the dynamic path-based method performs, on average, slightly better than the fixed path-based method and the fixed rulebased method in the artificial and real-world domains tested.
Acknowledgements This research was partially supported by an ARC grant (to
Ross Quinlan) and by a research agreement with Digital Equipment Corporation at the University of Sydney. The author appreciates the advice and suggestions Ross Quinlan has given. Ross Quinlan predicted that constructing conjunctions by searching over paths of a tree should be better than, or at least as good as, creating conjunctions from production rules for decision tree learning. This initiated the idea of this paper. Mike Cameron-Jones, Kai Ming Ting and Alen Var~ek gave many helpful suggestions. Many thanks to Ross Quinlan for providing C4.5, and to Geoff Webb for his useful comments and for helping me to obtain a better understanding of some details of systematic search with pruning. P.M. Murphy and D. Aha are gratefully acknowledged for creating and managing the UCI Repository of machine learning databases.
Appendix A A pruned tree created by C4.5 in the Tic-Tac-Toe domain middle-middle-square = x: bottom-right-square = b: p bottom-right-square = x: top-left-square = x: p top-left-square = b: p top-left-square = o: top-right-square = x: bottom-left-square = x: p bottom-left-square = b: p bottom-left-square = o: middle-right-square = x: p middle-right-square = o: n middle-right-square = b: n top-right-square :- o: top-middle-square = o: n top-middle-square = b: p top-middle-square = x: bottom-middle-square = x: p bottom-middle-square = o: n bottom-middle-square = b: n top-right-square = b: middle-left-square = x: p middle-left-square = b: p middle-left-square = o: bottom-left-square = x: p bottom-left-square = o: n bottom-left-square = b: p bottom-right-square = o: top-right-square = x: bottom-left-square = x: p bottom-left-square = b: p bottom-left-square = o: bottom-middle-square = o: n bottom-middle-square = b: p bottom-middle-square = x: top-middle-square = x: p
Z. Zheng/Knowledge-Based Svstem,~ I0 (1998) 421 4.¢0 t o p - m i d d l e - s q u a r e = o: n
b o t t o m - m i d d l e - s q u a r e = x: p
t o p - m i d d l e - s q u a r e = b: n
b o t t o m - m i d d l e - s q u a r e = o: n
t o p - r i g h t - s q u a r e = o:
b o t t o m - m i d d l e - s q u a r e = b: p
m i d d l e - r i g h t - s q u a r e = x: p
m i d d l e - m i d d l e - s q u a r e = b:
m i d d l e - r i g h t - s q u a r e = o: n
b o t t o m - l e f t - s q u a r e = x:
m i d d l e - r i g h t - s q u a r e = b: p t o p - r i g h t - s q u a r e = b: b o t t o m - m i d d l e - s q u a r e -- x: p
t o p - r i g h t - s q u a r e = x: p t o p - r i g h t - s q u a r e = b: p t o p - r i g h t - s q u a r e = o:
b o t t o m - m i d d l e - s q u a r e = b: p
b o t t o m - r i g h t - s q u a r e = x: p
b o t t o m - m i d d l e - s q u a r e = o:
b o t t o m - r i g h t - s q u a r e = o:
b o t t o m - l e f t - s q u a r e = x: p
m i d d l e - r i g h t - s q u a r e = x: p
b o t t o m - l e f t - s q u a r e = o: n
m i d d l e - r i g h t - s q u a r e = ~: n
b o t t o m - l e f t - s q u a r e = b: p m i d d l e - m i d d l e - s q u a r e = o:
m i d d l e - r i g h t - s q u a r e = b: p b o t t o m - r i g h t - s q u a r e = b:
t o p - r i g h t - s q u a r e = x:
t o p - l e f t - s q u a r e = x: p
t o p - l e f t - s q u a r e = x:
t o p - l e f t - s q u a r e = o: n
t o p - m i d d l e - s q u a r e = x: p t o p - m i d d l e - s q u a r e = o:
t o p - l e f t - s q u a r e = h: p b o t t o m - l e f t - s q u a r e = o:
b o t t o m - m i d d l e - s q u a r e = x: n
t o p - r i g h t - s q u a r e = o: n
b o t t o m - m i d d l e - s q u a r e --- o: n
t o p - r i g h t - s q u a r e = b: n
b o t t o m - m i d d l e - s q u a r e = b: p
t o p - r i g h t - s q u a r e = x:
t o p - m i d d l e - s q u a r e = b: b o t t o m - m i d d l e - s q u a r e = x: n
b o t t o m - r i g h t - s q u a r e = x: p b o t t o m - r i g h t - s q u a r e = o:
b o t t o m - m i d d l e - s q u a r e = o: p
b o t t o r n - m i d d l e - s q u a r e = x: p
b o t t o m - m i d d l e - s q u a r e = b: p
b o t t o m - m i d d l e - s q u a r e = o: n
t o p - l e f t - s q u a r e = o:
b o t t o m - m i d d l e - s q u a r e = b: p
b o t t o m - r i g h t - s q u a r e = o: n
b o t t o m - r i g h t - s q u a r e = b:
b o t t o m - r i g h t - s q u a r e = b: n
t o p - l e f t - s q u a r e = x: p
b o t t o m - r i g h t - s q u a r e = x:
t o p - l e f t - s q u a r e = o: n
m i d d l e - r i g h t - s q u a r e = x: p m i d d l e - r i g h t - s q u a r e = o: n m i d d l e - r i g h t - s q u a r e = b: p t o p - l e f t - s q u a r e = b: m i d d l e - r i g h t - s q u a r e = x: p
429
t o p - l e f t - s q u a r e = h: p b o t t o m - l e f t - s q u a r e = b: t o p - r i g h t - s q u a r e = x: p t o p - r i g h t - s q u a r e = o: n t o p - r i g h t - s q u a r e = b: p
m i d d l e - r i g h t - s q u a r e = o: n m i d d l e - r i g h t - s q u a r e = b: n t o p - r i g h t - s q u a r e = o: b o t t o m - l e f t - s q u a r e = o: n
Appendix B A decision tree and new attributes created by CAT in the Tic-Tac-Toe domain
b o t t o m - l e f t - s q u a r e = b: n b o t t o m - l e f t - s q u a r e = x:
D e c i s i o n tree:
t o p - m i d d l e - s q u a r e = b: p
N e w A 1 4 4 = t: p
t o p - m i d d l e - s q u a r e = x:
NewA144=
b o t t o m - r i g h t - s q u a r e = o: n b o t t o m - r i g h t - s q u a r e = b: p b o t t o m - r i g h t - s q u a r e = x: b o t t o m - m i d d l e - s q u a r e = x: p b o t t o m - m i d d l e - s q u a r e = o: n b o t t o m - m i d d l e - s q u a r e = b: n t o p - m i d d l e - s q u a r e = o: b o t t o m - m i d d l e - s q u a r e = x: p b o t t o m - m i d d l e - s q u a r e = o: n b o t t o m - m i d d l e - s q u a r e = b: p t o p - r i g h t - s q u a r e = b:
f:n
N e w attributes:
NewA50: t o p - l e f t - s q u a r e = o A t o p - m i d d l e - s q u a r e = o A top-right-square = o NewA54: top-left-square = o A middle-left-square = o A bottom-left-square = o NewA62: bottom-left-square = o A bottom-middlesquare = o A bottom-right-square = o NewA66: top-right-square = o A middle-right-square = o A bottom-right-square = o
NewA8S: t o p - m i d d l e - s q u a r e = o A m i d d l e - m i d d l e -
t o p - m i d d l e - s q u a r e = x: n
square = o A bottom-middle-square = o
t o p - m i d d l e - s q u a r e = b: p
NewA92: middle-left-square = o A middle-middle-
t o p - m i d d l e - s q u a r e = o:
s q u a r e = o A m i d d l e - r i g h t - s q u a r e := o
Z. Zheng/Knowled.~e-Based Systems 10fl998) 421 430
430 NewA95:
top-left-square
= o A middle-middle-square
= o A bottom-right-square NewAll0:
= o
top-right-square = o A middle-middlesquare = o A bottom-left-square
NewA144:NewA95
[12]
= o
= f A N e w A 1 10 = f
A NewA92
= f A NewA66
= f
A NewA50
= f A NewA62
= f
A NewA54
= f A NewA85
= f
Note that 'f' stands for false.
References
[13]
[14]
[l] C.J. Matheus, Feature construction: an analytic framework and an
[2] [3]
[4]
[5]
[6]
[7]
[8]
[9] [I 0]
[I 1]
application to decision trees, PfiD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL. 1989. G. Pagallo, D. Haussler, Boolean feature discovery in empirical learning, Machine Learning 5 (1990) 71 - 100. Z. Zheng, Constructing conjunctive tests for decision trees, in: Proceedings of the Fifth Australian Joint Conference on Artificial Intelligence, World Scientific, Singapore, 1992, pp. 355-360. G.I. Webb, Systematic search for categorical attribute-value data-driven machine learning, in: Proceedings of the Sixth Australian Joint Conference on Artificial Intelligence. World Scientific, Singapore, 1993, pp. 342-347. G. Pagallo, Adaptive decision tree algorithms for learning from examples, PfiD thesis, University of California at Santa Cruz, Santa Cruz, CA, 1990. D. Yang, L. Rendell, G. Blix, A scheme for feature construction and a comparison of empirical methods, in: Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, Morgan Kaufmann, San Mateo, CA, 1991, pp. 699-704. Z. Zheng, Constructing new attributes for decision tree learning, PhD thesis, Basset Department of Computer Science, University of Sydney, Australia, 1996. J.R. Quinlan, R.L. Rivest, Inferring decision trees using the minimum description length principle, Information and Computation 80 (1989) 227-248. J.R. Quinlan, C4.5: Programs tk)r Machine Learning, Morgan Kaufmann, San Mateo, CA, 1993. Z. Zheng, Constructing nominal X-of-N attributes, in: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann, San Mateo. CA, 1995, pp. 1064-1070. P.M. Murphy, D.W. Aha, UCI repository of machine learning databases, Department of Information and Computer Science,
[I 5]
[16]
[17]
[18]
[19]
[20]
[211
[22]
University of California, lrvine, CA, http://www.ics.uci.cdu/ mlearn/MLRepository.html, I996. S.B. Thrun, J. Bala, E. Bloedorn, I. Bratko, B. Cesmik, J. Cfieng, K. De Jong, S. D~eroski, S.E. Fahhnan, D. Fisher, R. l-tamann, K. Kaufman. S. Keller. I. Kononenko, J. Kreuziger. R.S. Micfialski, T. Mitchell, P. Pachowicz, Y. Reich, H. Vafaie, W. Van de Welde, W. Wenzel, J. Wnek, J. Zhang, The MONK's problems a performance comparison of different learning algorithms, Technical Report CMU-CD--91-197, Department of Computer Scien~:c, Carnegie Mellon Llniversity, Pittsburgh, PA, 1991. J.R. Quinlan, R.M. Cameron-Jones, Oversearcfiing and layered search in empirical learning, in: Proceedings of the Fourteenth International Joint Coutk:rence on Artificial Intelligence, Morgan Kaul:munn, San Maleo, CA, 1995. pp. 1019-1024. C.J. Matheus, L.A. Rendell, Consmlctive inductiun on decision trees, in: Proceedings of the Eleventh International Joint Conl;erence ,m Artificial Intelligent:c, Morgan Kaufmann, San Mateo, CA, I989, pp. 645 650. H. Ragavun, L. Rendell, Lookafiead feature construcfion lor learning hard concepts, in: Proceedings of the Tenth International Conference on Machine l~earning, Morgan Kaufmann. San Mutco, CA. 1993, pp. 252 259. P.M. Murphy, M.J. Pazzani, ID2-of-3: constructive induction of M-of-N concepts for discriminators in decision trees, in: Proceedings of the Eighth International Workshop on Machine Learning, Morgan Kaufmann, San Mateo, CA, 1991, pp. 183-187. C.E. Brodley, P.E. Utgoff, Multivariate versus univariate decision trees, COINS Technical Report 92-8, Department of (7onlptltcr Science, University of Massachusetts, Amherst, MA, 1992. R. Rymon. Search through systematic set enumeration, in: Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning. MIT Press, Cambridge, MA, 1992. pp. 539-550. R. Rymon, An SE-tree based characterisation of the induction problem, in: Proceedings of the Tenth International Conl~:rence on Machine Learning, Morgan Kaufmann. San Marco, ('A, 1993, pp. 268-275. J.C. Scfilimmer, Efficiently reducing determinations: a complete and systematic search algorithm thai uses optimal pruning, in: Proceedings of the Tenth International Conference on Machine Learning. Morgan Kaufmaml, San Marco, CA. 1993. pp. 284 290. G.I. Webb. OPUS: an efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research 3 (1995)431465. G.I. Wcbb. Inclusive pruning: a new class of pruning role lor unordered search and its application to classification learning, m: Proceedings of the Nineteenth Australian Computer Science Conference, Australian Computer Science Communications. Vol. 18, 1996. pp. I 10.