Expert Systems with Applications 25 (2003) 199–209 www.elsevier.com/locate/eswa
Constructing a multi-valued and multi-labeled decision tree Yen-Liang Chen*, Chang-Ling Hsu, Shih-Chieh Chou Department of Information Management, National Central University, Chung-Li 320, Taiwan, ROC
Abstract Most decision tree classifiers are designed to classify the objects whose attributes and class labels are single values. However, many practical classification problems need to deal with multi-valued and multi-labeled data. For example, a customer data in a tour company may have multi-valued attributes such as the cars, the hobbies and the houses of the customer and multiple labels corresponding to the tours joined before. If the company intends to use customers’ data to build a classifier to predict what kinds of customers are likely to participate in what kinds of tours; then a requirement arises immediately is how to design a new classification algorithm to classify the multi-valued and multilabeled data. Therefore, this research has engaged in developing such a new classifier. We found that the design of some major functions used in our classifier is different from the existing ones, including how to select the next splitting attribute, when to stop the splitting of a node, how to determine a node’s labels, and how to predict the labels of a new data. In this paper, all these issues are addressed and the problems are solved. The simulation result shows that the proposed algorithm performs well both in computing time and in accuracy. q 2003 Elsevier Science Ltd. All rights reserved. Keywords: Decision tree; Data mining; Classification; Multi-valued attribute; Multi-labeled attribute; Prediction; Customer relation management
1. Introduction Classification is a learning method frequently adopted in the fields of data mining, statistics, machine learning, genetic algorithm and neural networks (Han & Kamber, 2001). The classification problem is a two-step process, where the first is to build a classification model by analyzing the training sample set described by attributes and the second is to use this model to classify the future sample for which the class label is not known. For example, we can use the classification model learned from the existing customers’ data to predict what services a new customer would like. In the past, a number of different approaches have been proposed to build the classifiers from the training sample set. As indicated in Han and Kamber, (2001), the well-known ones include decision tree classifiers, Bayesian classifiers, neural network classifiers, k-nearest neighbor classifiers, case-based reasoning methods, genetic algorithms, rough set approach and fuzzy set approach. Among these approaches, the approach of decision tree classifiers is probably the most popular and widespread, for it has the strengths such as (1) able to generate understandable rules, (2) perform classification without requiring much computation, (3) able to * Corresponding author. Tel.: þ 886-3-4267266; fax: þ 886-3-425604. E-mail address:
[email protected] (Y.-L. Chena).
handle both continuous and categorical variables and (4) provide a clear indication of which attributes are most important for prediction or classification. Because of its importance, the problem of building decision tree classifiers has been extensively studied in the past. Many algorithms have been reported (Agrawal et al., 1992; Agrawal et al., 1993; Bramer, 2002; Mehta et al., 1996; Moshkovich et al., 2002; Quinlan, 1986; Quinlan, 1993; Rastogi, 2000; Ruggieri, 2002; Shafer et al., 1996; Steinberg & Colla, 1995; Umano et al., 1994; Wang et al., 1998; Wang & Zaniolo, 2000). However, these algorithms have a common shortcoming that they usually assumed that the input data was relational. That is, the input data is a table containing a set of classifying attributes, where each attribute can be mapped to a single value of Boolean type, categorical type or numerical type. In addition, each training sample is also assumed to belong to a predefined class that is determined by one of the attributes called class-label attribute. In the following, we discuss two real world situations that reveal the weakness of the traditional approach. 1. Multi-valued data. The attribute of a sample may have a set of values rather than an atomic value and the number of values of an attribute is variable. For example, the customer data in a tour company may include hobbies of the customers and languages used by the customers.
0957-4174/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0957-4174(03)00047-2
200
Y.-L. Chen et al. / Expert Systems with Applications 25 (2003) 199–209
Since a customer may have a number of hobbies and use several languages, these two attributes, ‘hobby’ and ‘language’, are multi-valued attributes. 2. Multi-labeled data. A sample may contain a single or multiple class labels. That is, each sample belongs to multiple classes simultaneously. There are many situations in the real world that an object can only be described by multi-labeled data. For example, a tour company might be interested in predicting what customers are interested in what tour packages. In this case, the customers’ profiles and their past travel histories could be used to build a model which can describe the associations between the customers’ characteristics and the customers’ preferred tours. If we use the tour as the label of a customer, a customer data will have several labels associated with it since a customer may have joined several tours in the past. As another example, a mutual fund company might be interested in building a model that can associate the characteristics of customers with the funds they would possibly buy. Since a mutual fund company usually offers a lot of funds to meet different investment needs and a customer may own several funds, a customer data may be attached with a set of labels. Here, some readers may wonder why we do not use the standard database normalization method (Codd, 1972) to transform the data into the third normal form, i.e. single-valued and single-labeled, and then apply the traditional classification methods to build the decision tree. If we really did this way, then the obtained classifier could only predict a single label for an input sample where all of its attributes are single values. This violates the requirement that a decision tree classifier designed for multi-valued and multi-labeled data should be able to predict multiple labels, even when all the attributes of a sample are happen to be single values. For
example, if we have a tourist data indicating gender female and income $750, the traditional classifier can only predict a single label, say she may join the tour C1 ; but the new classifier can predict several labels, say she is likely to join the tours C1 ; C2 and C3 : In view of this weakness, the aim of this paper is to design a novel classification algorithm for multi-valued and multi-labeled data. Throughout this paper, we will use an example to illustrate the problem situation and the solution. The example assumes that a tour company uses customer profiles and the tours they have joined before to build a model to predict in what tours a new customer will be interested. We will formally define the problem in Section 2. Then, we will propose the tree construction algorithm and the prediction algorithm in Section 3. The performance evaluation is presented in Section 4 and the conclusion is given in Section 5.
2. Problem definition Before giving a formal definition, we use a simple example of 15 customers in Table 1 as a training set to describe the problem, requirement and expected result. From the data set, we want to build a decision tree. In the table, the attributes of marital status, gender and hobby are categorical; the attribute of income is numerical; and the attribute of hobby is multi-valued. Each object of the data can have at most three different labels C1 ; C2 and C3 associated with it. For example, customer P3 has the attribute of hobby as {arts, shopping}, and customer P5 is attached with two labels C2 and C3 : Suppose the decision tree built by the training set is like the one shown in Fig. 1. In the tree, each internal node corresponds to a decision on an attribute and each branch
Table 1 Training set for 15 customers Customer id
Marital status
Income ($)
Gender
Hobby
Class label
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15
M S M D S S S D D S M D S S S
100 880 370 1230 910 770 590 1350 1250 1140 340 1300 1090 810 520
Female Male Female Male Male Female Female Male Male Male Female Male Male Male Female
Arts Arts Arts, shopping Sports Arts, sports Arts Arts, shopping Shopping Arts, shopping Arts, shopping Arts, sports Arts Sports Shopping Arts, sports, shopping
C1 ; C2 ; C3 C2 ; C3 C1 C2 C2 ; C3 C1 ; C2 ; C3 C1 ; C2 C1 ; C2 ; C3 C1 ; C2 ; C3 C1 C1 ; C3 C1 ; C2 C3 C1 C3
Y.-L. Chen et al. / Expert Systems with Applications 25 (2003) 199–209
201
Fig. 1. A multi-valued and multi-labeled decision tree.
corresponds to a possible value or an interval of this attribute. The leaves are the final results of the multi-classed labels. By this tree, we can predict what tours a new customer will be interested in. For example, if there is a female customer with income $750, then the tree indicates that she is likely to join the tours C1 ; C2 and C3 : As the second example, let us consider another customer whose income is $500 and whose hobby includes sports and arts. Tracing down the tree from the root, we will reach two leaves, where the first leaf node has a label C1 and the second has another label C3 : In this case, we will report that this user may be interested in the tours C1 and C3 : Finally, we can transform the decision tree into a set of rules equivalent to the tree. Fig. 2 shows some rules generated from the tree. Each rule is generated by traversing the decision tree, i.e. starting from the root node and finding all paths to the leaf nodes. Each path results in an ‘if-condition-then-label’ rule. Now, we formally define the problem. We assume that the training data set is stored in file D; and let lDl denotes the number of records in D: Let C ¼ {Ci li ¼ 1; …; q} be the set of all class labels, where Ci is a class label. Besides, we use a variable Lj ; called label-set, to represent a set of labels in C; where Lj # C: Therefore, each record in D can be represented as ðA; Lj Þ; where A is a set of attributes Ai : Note that each attribute Ai can be mapped to a single value or multiple values, and can be categorical or numerical. Our goal is to build a decision tree classifier that can predict the value of Lj when the values of A are given. The constructed tree is a multi-degreed tree TðV; EÞ, where V is a set of nodes and E is a set of branches. To deal with numerical attributes, we use a user-specified parameter ub to set an upper bound on the number of branches that an internal node of numerical attribute can fan out. That is, for each internal node vi [ V; where vi corresponds to a decision made on a numerical attribute, it must satisfy: 2 # Degreeðvi Þ # ub; where Degreeðvi Þ is the number of outgoing arcs of node vi : The parameter ub can help us to avoid too many rules. In addition, when splitting branches from a node, we adopt the definition of an interval in IC
(Agrawal et al., 1992): each branch corresponds to a range of values for a numeric attribute and corresponds to a single value for a categorical attribute.
3. The algorithm In a multi-valued and multi-labeled decision tree, a labelset selected to represent a leaf node is based on the similarity between two label-sets and among a set of label-sets. Thus, the development of the algorithm depends on what is the similarity between two label-sets and among a set of labelsets. We first define what they are and discuss how to measure them in Section 3.1. Then, we will develop an algorithm for constructing a decision tree in Section 3.2. Finally, we will discuss how to use the constructed tree to predict the label-set to which a new object belongs, and discuss how to measure the accuracy of a prediction in Section 3.3. 3.1. Measuring the similarity We first define the symbols that are used to measure the similarity between two label-sets Li and Lj as follows. sameðLi ; Lj Þ : The number of labels that appear in both Li and Lj : differentðLi ; Lj Þ : The number of labels that appear either in Li or Lj but not both. cardinalityðLi ; Lj Þ : The number of different labels that appear in Li or Lj :
Fig. 2. The rules obtained from the decision tree in Fig. 1.
202
Y.-L. Chen et al. / Expert Systems with Applications 25 (2003) 199–209
Using the above symbols, we define the similarity between Li and Lj as similarityðLi ; Lj Þ ¼
sameðLi ; Lj Þ differentðLi ; Lj Þ 2 þ1 cardinalityðLi ; Lj Þ cardinalityðLi ; Lj Þ
!, 2
For example, if Li ¼ {C1 ; C2 } and Lj ¼ {C1 ; C3 } then we have sameðLi ; Lj Þ ¼ 1; differentðLi ; Lj Þ ¼ 2; cardinality ðLi ; Lj Þ ¼ 3 and similarityðLi ; Lj Þ ¼ 1=3: As another example, if Li ¼ {C1 ; C2 } and Lj ¼ {C3 ; C4 }; then we have sameðLi ; Lj Þ ¼ 0; differentðLi ; Lj Þ ¼ 4; cardinalityðLi ; Lj Þ ¼ 4 and similarityðLi ; Lj Þ ¼ 0: In Table 2, we show the similarity measures for all the label-sets obtained from C1 ; C2 and C3 : We may call this table as the similarity table. Before running the tree construction algorithm in Section 3.2, if we compute the similarity table for all possible label-sets beforehand, then the time required to determine the similarity between any two label-sets can be done in a constant time by searching the table directly. Therefore, we assume that the similarity table has been pre-computed before executing the tree construction algorithm. Although this definition of similarity comes from heuristics, it has many good properties. We list these properties below, but omit the proofs for brevity. Property 1 similarityðLi ; Lj Þ ¼ similarityðLj ; Li Þ: Property 2 0 # similarityðLi ; Lj Þ # 1: Property 3 If all the labels in Li and Lj are different, then similarityðLi ; Lj Þ ¼ 0: Property 4 If all the labels in Li and Lj are the same, then similarityðLi ; Lj Þ ¼ 1:
Property 5 For three different label-sets Li ; Lj and Lk ; if sameðLi ; Lk Þ ¼ sameðLj ; Lk Þ but differentðLi ; Lk Þ # differentðLj ; Lk Þ; then similarityðLi ; Lk Þ $ similarityðLj ; Lk Þ: Property 6 For three different label-sets Li ; Lj and Lk ; if sameðLi ; Lk Þ # sameðLj ; Lk Þ but differentðLi ; Lk Þ ¼ differentðLj ; Lk Þ; then similarityðLi ; Lk Þ # similarityðLj ; Lk Þ: Property 7 For three different label-sets Li ; Lj and Lk ; if sameðLi ; Lk Þ , sameðLj ; Lk Þ but differentðLi ; Lk Þ . differentðLj ; Lk Þ; then similarityðLi ; Lk Þ , similarityðLj ; Lk Þ: Based on the similarity measure between two label-sets, we can define the similarity among a set of label-sets L ¼ {L1 ; L2 ; …; Lm } as follows. X set-similarityðLÞ ¼
i,j
similarityðLi ; Lj Þ
mðm 2 1Þ=2
ð1Þ
Although this definition seems reasonable, it may cause a performance problem. When the size of m is big, it is time-consuming on computation. Unfortunately, it is possible to happen, for the amount of data to be classified may be huge. To remedy this difficulty, we need to rewrite the definition. Note that although the number of data is usually big, the number of different labels is usually small. That is, many label-sets in L may be the same, and hence the original L can be rewritten as the form {ðNL1 ; count1 Þ; ðNL2 ; count2 Þ; …; ðNLr ; countr Þ}; where NLi denotes the ith distinct label-set and counti is the number of label-sets in L whose label-sets P are NLi : In other words, we have the relation that ri¼1 counti ¼ m: Therefore, the original definition of set-similarityðLÞ can be rewritten as
set-similarityðLÞ X Xr count counti C similarityðNL ; NL Þ þ C counti C1 j similarityðNLi ; NLj Þ i i 2 i¼1 i,j 1 ¼ mðm 2 1Þ=2
Table 2 The similarity table for all the label-sets obtained from C1 ; C2 and C3
C1 C2 C3 C1 ; C2 C1 ; C3 C2 ; C3 C1 ; C2 ; C3
C1
C2
C3
C1 ; C2
C1 ; C3
C2 ; C3
C1 ; C2 ; C3
1 0 0 1/2 1/2 0 1/3
0 1 0 1/2 0 1/2 1/3
0 0 1 0 1/2 1/2 1/3
1/2 1/2 0 1 1/3 1/3 2/3
1/2 0 1/2 1/3 1 1/3 2/3
0 1/2 1/2 1/3 1/3 1 2/3
1/3 1/3 1/3 2/3 2/3 2/3 1
Because similarityðNLi ; NLi Þ ¼ 1; C1counti ¼ counti and count C1 j ¼ countj ; it can be further simplified as set-similarityðLÞ X Xr C counti þ i,j counti £countj £similiarityðNLi ;NLj Þ i¼1 2 ¼ mðm21Þ=2 ð2Þ Comparing formula (1) with formula (2), we see that the computing time is improved from Oðm2 Þ to Oðr 2 Þ: When m q r; it spares much time.
Y.-L. Chen et al. / Expert Systems with Applications 25 (2003) 199–209
Example 1. Suppose L ¼ {L1 ; L2 ; …; L7 }; where L1 ¼ L2 ¼ NL1 ¼ {C1 }; L3 ¼ L4 ¼ L5 ¼ NL2 ¼ {C1 ; C3 } and L6 ¼ L7 ¼ NL3 ¼ {C1 ; C2 ; C3 }: So, we have m ¼ 7; count1 ¼ 2; count2 ¼ 3 and count3 ¼ 2: By searching Table 2, we find that similarityðNL1 ; NL2 Þ ¼ 0:5; similarityðNL1 ; NL3 Þ ¼ 1=3 and similarityðNL2 ; NL3 Þ ¼ 2=3: Applying these numbers to Eq. (2), we get set-similarityðLÞ 1 þ 3 þ 1 þ ð2 £ 3 £ 0:5Þ þ ð2 £ 2 £ ð1=3ÞÞ þ ð3 £ 2 £ ð2=3ÞÞ : ¼ ð7 £ 6Þ=2 The final result is set-similarityðLÞ ¼ 0:634: 3.2. The MMC algorithm The MMC (Multi-valued and Multi-labeled Classifier) algorithm is designed to construct a multi-valued and multilabeled decision tree. It follows the standard framework adopted by the classical classification methods such as ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993), MIND (Wang et al., 1998), IC (Agrawal et al., 1992), SLIQ (Mehta et al., 1996), SPRINT(Shafer et al., 1996). Here, we explain how MMC goes. 1. MMC (Training Set D) 2. Initialize tree T and put all records of D in the root; 3. while (some leaf in T is not a STOP node) 4. for each attribute i of each non-STOP node do 5. for each split value of attribute i 6. evaluate how good of this splitting of attribute i; 7. for each non-STOP leaf do 8. get the best split for it; 9. partition the records and grow the tree for one more level according to the best splits; 10. Identify the nodes that can be stopped, and mark them as STOP nodes and determine their final labels; 11. return T In the above framework, two critical points should be further clarified: step 10 and steps 4 – 6. The former is to determine what nodes will stop growing, and how to determine their label-sets. The latter is to deal with how we can split a non-STOP node into several children nodes. In the following, we will discuss these two problems. How to solve step 10? Let C ¼ {C1 ; C2 ; …; Ck } denote all class labels in database D: Let d1 ; d2 ; …; dr be the data records associated with the current node CN; and let L1 ; L2 ; …; Lr be their labelsets, respectively, where Li # C for all i: Then, the support of class label Ci is the percentage of data records in CN that contains class label Ci : If the support of a class label Ci is greater than or equal to the user-specified minimum support (called minsup), we call Ci a large label. Otherwise, it is small. Therefore, all the class labels in CN can be classified
203
into two sets of smallðCNÞ and largeðCNÞ; where smallðCNÞ contains all the small labels in CN while largeðCNÞ large labels in CN: Further, we define the difference of node CN as the smallest support of labels in largeðCNÞ minus the largest support of labels in smallðCNÞ: Formally, it can be written as differenceðCNÞ ¼ min{supportðCi ÞlCi in largeðCNÞ} 2 max{supportðCi ÞlCi in smallðCNÞ} If the difference of a node CN is greater than or equal to the user-specified minimum difference (called mindiff), then we call CN a clear node. Otherwise, it is unclear. For a clear node, we assign all the labels in largeðCNÞ as the label-set of node CN: If the following conditions are met, we will stop the growing of a node. Otherwise, the node has to be expanded further. 1. 2.
3.
If node CN is a clear node, then assign all labels in largeðCNÞ as its label-set. If node CN is unclear but all the attributes have been used in the path from root down to CN; then there are two cases. 2.1. if largeðCNÞ is not empty then assign all labels in largeðCNÞ as its label-set. 2.2. if largeðCNÞ is empty then assign the label with the maximum support as its label-set. If node CN is unclear but the number of data records is smaller than the user-specified minimum quantity (called minqty), then it can be dealt with just like the way we did in conditions 2.1 and 2.2.
Example 2. Suppose node CN has five data records, and their label-sets are {C1 ; C2 }; {C2 ; C3 ; C5 }; {C3 }; {C1 ; C4 } and {C1 ; C2 }; respectively. Then, we have supportðC1 Þ ¼ 60%; supportðC2 Þ ¼ 60%; supportðC3 Þ ¼ 40%; supportðC4 Þ ¼ 20% and supportðC5 Þ ¼ 20%: If we let minsup ¼ 50%; then largeðCNÞ ¼ {C1 ; C2 } and smallðCNÞ ¼ {C3 ; C4 ; C5 }; and differenceðCNÞ ¼ 60 – 40% ¼ 20%: Suppose we set mindiff ¼ 10%: Then, because of differenceðCNÞ . mindiff ; node CN is a clear node and its label-set is {C1 ; C2 }: How to solve steps 4– 6? For a non-STOP node, we have to select the most discriminatory attribute for splitting and partition its intervals. The information gain measure (entropy-based approach) is probably the most popular approach for doing this. Gain ratio (Agrawal et al., 1992, 1993; Quinlan, 1986, 1993; Umano et al., 1994) and gini index (Mehta et al., 1996; Shafer et al., 1996; Wang & Zaniolo, 2000) are two famous indices designed by this measure. By computing the indices, we can determine the goodness of an attribute, and the attribute with the best goodness is chosen as the splitting attribute for the current node. Unfortunately, this famous approach could not be used in our problem. Let us use an example to explain it.
204
Y.-L. Chen et al. / Expert Systems with Applications 25 (2003) 199–209
Example 3. Suppose node CN1 has three data records, and their label-sets are {C1 ; C2 ; C3 }; {C1 ; C2 ; C3 } and {C1 ; C2 ; C3 }; respectively. Node CN2 has three data records with label-sets {C1 ; C2 ; C4 }; {C1 ; C2 ; C5 } and {C1 ; C2 ; C6 }: Node CN3 has three data records with label-sets {C1 ; C2 }; {C3 ; C4 } and {C5 ; C6 }: Obviously, Node CN1 is a node that should be stopped, since all label-sets in it are the same. And it is clear that node CN2 is better than node CN3 ; for the label-sets in CN2 is much more similar than those of node CN3 : Unfortunately, if we compute the information gain according to their labels, node CN1 will be a bad node as it has probabilityðC1 Þ ¼ 1=3; probabilityðC2 Þ ¼ 1=3 and probabilityðC3 Þ ¼ 1=3: On the other hand, if we compute the information gain according to the label-sets, nodes CN2 and CN3 will be equally bad, because both nodes have the same problem that all the label-sets are entirely different. The above example indicates that the information gain approach is not suitable for our problem. The reason for its failure is that the information gain approach assumes that all different groups (label-sets in our case) are independent with one another. Unfortunately, this assumption does not fit our problem. For example, label-sets {C1 ; C2 ; C3 ; C4 } and {C1 ; C2 ; C3 } are different, but they are not independent with each other, for they have three common elements among them. Owing to the above difficulty, we decide to give up the traditional approach. Instead, we use the similarity function, set-similarityðLÞ discussed in Section 3.1, to select the best splitting attribute and to partition its intervals. In the following, we will first discuss how to partition the intervals for an attribute. Then, we will discuss how to select the best splitting attribute. Let DCN denote the set of data stored in node CN; and let n denote the number of data records in DCN : Suppose that we want to partition it according to numerical attribute Al : After sorting the data in DCN by attribute Al ; the data is ordered as an increasing sequence. Let the jth percentile value of attribute Al in this node be denoted as percentileðAl ; jÞ: Then we can partition node CN into at most k intervals according to attribute Al by the following way. 1. Split-IntervalsðCN; Al ; kÞ 2. Let b1 denote the smallest value of attribute Al in CN 3. Let left ¼ b1 2 2D /Here, D denotes a very small number./ 4. For j ¼ 1 to k 5. index ¼ ðj=kÞ £ 100% 6. right ¼ percentileðAl ; indexÞ 7. if left ¼ right then skip this iteration 8. else generate interval ½left þ D; right 9. Endfor Example 4. Assume that node CN has 10 data, and the values of attribute age are 15, 15, 15, 15, 17, 18, 18, 25, 28 and 28. Suppose that we want to partition them into five intervals, then we have b1 ¼ 15; percentile(age, 20%) ¼ 15,
percentile(age, 40%) ¼ 15, percentile(age, 60%) ¼ 18, percentile(age, 80%) ¼ 25 and percentile(age, 100%) ¼ 28. Running the for-loop, we will generate intervals [15 2 D; 15], [15 þ D; 18], [18 þ D; 25] and [25 þ D; 28]. Here, we only generate four intervals instead of five intervals, for the second iteration of the loop does not generate an interval. By viewing an interval as a range of values for a numerical attribute or a single value for a categorical attribute, we can partition DCN into k parts according to the interval they fall. Let these data sets be denoted as DCN ð1Þ; DCN ð2Þ; …; DCN ðkÞ; and let n1 ; n2 ; …; nk denote the numbers of data records in these groups, and let n0 ¼ Pk 0 i¼1 ni : (Note that, we have n $ n; for an attribute may have multiple values and a data record can belong to multiple intervals.). For each data set DCN ðiÞ; we can compute its similarity by function set-similarityðFÞ defined in Section 3.1, and let set-similarityi denote this value. Now, we are ready to measure the goodness of this splitting, and we define the weighted similarity of this splitting of attribute Al on node CN as follows. w-similarityðCN; Al ; kÞ ¼
k X
set-similarityi
i¼1
ni n0
In the above formula, if the attribute is categorical, then k is the number of different values of attribute Al in the node. Finally, it is easy to select the best splitting attribute for node CN; and its procedure is listed below. 1. Next-attributeðCNÞ 2. For each attribute i do 3. if it is a categorical attribute, compute its weightedsimilarity 4. if it is a numerical attribute 5. let k ¼ ub; compute its weighted-similarity 6. Endfor 7. Choose the attribute and split which can get the largest weighted similarity Example 5. Table 3 is the data stored in node CN; which has 10 data records and two classifying attributes: age and Table 3 An example with 10 data and two classifying attributes Id
Hobby
Age
Labels
Id-1 Id-2 Id-3 Id-4 Id-5 Id-6 Id-7 Id-8 Id-9 Id-10
Arts, shopping Arts, sports Arts Shopping, sports Arts, shopping, sports Shopping, sports Sports Arts, shopping Shopping, sports Shopping
15 17 28 15 15 25 28 18 15 18
C1 ; C2 C2 ; C3 C3 C1 ; C2 ; C3 C2 C1 C2 ; C3 C1 ; C3 C1 ; C2 ; C3 C2
Y.-L. Chen et al. / Expert Systems with Applications 25 (2003) 199–209
hobby. First, we consider the attribute hobby, and node CN can be split into three parts: {id-1, id-2, id-3, id-5, id-8} for arts, {id-1, id-4, id-5, id-6, id-8, id-9, id-10} for shopping and {id-2, id-4, id-5, id-6, id-7, id-9} for sports. For these three sets, we compute their set-similarities, and the answer is: 0.3 for arts, 0.43 for shopping and 0.47 for sports. Finally, we compute the weighted similarity of attribute hobby as follows. Weighted-similarity ¼
5 7 6 £ 0:3 þ £ 0:43 þ £ 0:47 18 18 18
¼ 0:407 Next, we need to consider the attribute age. Since age is a numerical attribute, if we set ub ¼ 5; it can be partitioned into at most five intervals. As demonstrated in Example 4, we will finally partition it into four intervals: ½15 2 D; 15; ½15 þ D; 18; ½18 þ D; 25 and ½25 þ D; 28: These intervals have the corresponding data as: {id-1, id-4, id-5, id-9} for the first, {id-2, id-8, id-10} for the second, {id-6} for the third and {id-3, id-7} for the last. For these four groups, we compute their set-similarities, and the answer is: 0.58 for the first, 0.27 for the second, 1 for the third and 0.5 for the last. In the third group, we set the similarity as 1, since it has only one data record. Finally, we compute the weighted similarity of attribute hobby as follows. Weighted-similarity ¼
4 3 1 2 £ 0:58 þ £ 0:27 þ £1þ £ 0:5 ¼ 0:513 10 10 10 10
Since the weighted similarity of the attribute age is larger than that of the attribute hobby, we will select attribute age as our next splitting attribute. Finally, we will deal with the performance problem. Some readers may notice that the tree construction algorithm needs to do a lot of sorting. For each internal node CN; we execute sorting OðrÞ times, where r is the number of numerical attributes. Without sorting, we are not able to partition the data into different groups for different numerical attributes. When a tree is large, the time spending on sorting could be excessive. To remedy this problem, we adopt the method used by SPRINT algorithm (Shafer et al., 1996), which avoids costly sorting at each node by presorting numerical attributes only once at the beginning of the algorithm. Each numerical attribute is maintained in a sorted attribute list. Each entry in the list contains a value of the attribute, the class labels of the record, and its corresponding record id (rid). The initial lists created from the training set are associated with the root node of the decision tree. As the tree grows and the nodes are split to create new children, the attribute lists belonging to each node are partitioned and associated with the children. When a list is partitioned, the order of records in the list is preserved; thus, partitioned list never requires resorting.
205
3.3. Predicting the labels for new data In this section, we will discuss two problems: (1) how to determine the labels for a new data record, and (2) how to determine the accuracy of a prediction. In the traditional classification, we can predict the labels of a data record by traversing the tree from the root until we finally reach a leaf node. We then return the label of this node as our prediction result. However, since an attribute may have multiple values in our problem, we may reach several leaf nodes. So, we union all of these labels, and return them as our prediction result. The following is the procedure to do this. 1. 2. 3. 4. 5. 6. 7. 8.
PredictðuÞ if u is a leaf node then return the labels of u result ¼ f For each child v of node u if the condition of arc ðu; vÞ is satisfied by the data then result ¼ result < PredictðvÞ Endfor return result
Finally, when we are given a prediction result, how can we determine its accuracy? In the classical decision tree, a prediction only returns one label, and the test data has only one answer. So, each prediction has its accuracy as either 1 or 0. On the contrary, our prediction model will return a label-set, say Li ; as the prediction result and this testing data may have its real label-set as Lj : It is not fair to give it 1 or 0 if they are similar but not totally the same or different. Therefore, we use the similarity measure between two label-sets defined in Section 3.1 as its prediction accuracy.
4. Performance evaluation To study the performance, the algorithm is implemented in Cþ þ language and tested on a Pentium4-2 GHZ Microsoft Windows 2000 Server system with 1024 megabytes of main memory. We generate the synthetic data by modifying the well-known synthetic data proposed in (Agrawal et al., 1992, 1993; Shafer et al., 1996; Wang et al., 1998). The synthetic data is a customer database for a tour company in which a person has nine attributes given in Table 4, where attributes ‘car’, ‘hobby’, and ‘occupation’ are multi-valued attributes. Using these attributes, we define five groups of functions: class C1 ; class C2 ; class C3 ; class C4 and class C5 : If a data record meets any of these five group functions, then this record will be tagged with the corresponding labels. For example, if a data record meets the functions of group C1 and group C4 ; then this record will have the labels as {C1 ; C4 }: These five classification functions are shown in Fig. 3.
206
Y.-L. Chen et al. / Expert Systems with Applications 25 (2003) 199–209
Table 4 Description of attributes Attribute
Description
Number of values
Value
Salary Gender Age Car Hobby Hvalue Elevel m-status Occupation
Salary Gender Age Make of the car Hobby Value of the houses Education level Marital status Occupation
1 1 1 Uniformly chosen from 1 to 3 Uniformly chosen from 1 to 5 1 1 1 Uniformly chosen from 1 to 2
Uniformly distributed from $20,000 to 150,000 Uniformly chosen from 1 to 2 Uniformly chosen from 20 to 80 Uniformly chosen from 1 to 20 Uniformly chosen from 1 to 20 Uniformly distributed from $50,000 to 150,000 Uniformly chosen from 1 to 5 Uniformly chosen from 1 to 3 Uniformly chosen from 1 to 10
For every experiment, we generate a training set and a test data set. Records in the training set are assigned the labels by first generating the record and then applying the classification functions on the record to determine the groups to which the record belongs. Labels are also generated for records in the test set. For each experimental run, the accuracies for all the test records are averaged to obtain the classification accuracy. Throughout the experiment, we use 5000 records as our test data set. But, we have different sizes of training data, including
2000 records, 4000 records, 6000 records, 8000 records and 10,000 records. In addition to the data size parameter, the experiment includes the other parameters as follows. 1. minsup: a label is large if its support reaches minsup. 2. mindiff: a node can be stopped if the difference between the smallest support of large labels and the largest support of small labels reaches mindiff. 3. minqty: a node can be stopped if the number of records in it is less than minqty.
Fig. 3. Classification functions.
Y.-L. Chen et al. / Expert Systems with Applications 25 (2003) 199–209
4. ub: the upper bound on the number of intervals a numerical attributes can be partitioned. In our experiment, we vary the values of these parameters to observe the results of the classification algorithm. However, unless stated otherwise, we fix the parameters as: training set size ¼ 6000, minsup ¼ 50%, mindif ¼ 15%, minqty ¼ 6 and ub ¼ 6. Table 5(1) shows different accuracies, times and the numbers of rules for the training data set with different sizes including 2000 records, 4000 records, 6000 records, 8000 records and 10,000 records. The results indicate that there is a threshold for the size of the training data set. As the size of the training data set exceeds a certain threshold, further expanding its size not only increases the tree construction time but also decreases the classification accuracy. The following is our explanation for this phenomenon. When the training data set is still small, adding more data into the data set will enrich its information. However, when the data set grows beyond a certain size, most of the important information has been included. From that point on, further expanding of the training data set increases much more noises than information. This is why the tree becomes bigger (number of rules increases) and the classification accuracy declines when the training data set exceeds 8000 records. Table 5(2) shows different accuracies, times and the numbers of rules for the different values of minsup including 40, 45, 50, 55 and 60%. First, we explain why the running time decreases as minsup increases. Let us consider 60 and 40%. In the former case, a node will stop splitting if the largest support of the small labels in that node falls in [0, 45%]. However, the latter case will stop splitting if this
207
value is in [0, 25%]. From this simple comparison, we see that the larger the value of minsup is, the greater chance it has to stop splitting. Second, we explain the accuracy. Note that when the value of minsup gets larger, the tree will stop growing earlier. On the other hand, when the value of minsup gets smaller, the tree will stop growing later. The former implies that some important information has not yet included, but the latter implies that much noise may have been included. Therefore, to have a better performance, a tree must not be too big and not be too small. From Table 5(2), we see that the proper size is achieved if we set the value of minsup as 50 or 55%. Table 5(3) shows different accuracies, times and the numbers of rules for the different values of mindiff including 5, 10, 15, 20 and 25%. First, we explain why the running time increases as mindiff increases. Let us consider 5 and 25%. In the former case, a node will stop splitting if the largest support of the small labels in that node falls in [0, 45%]. However, the latter case will stop splitting if this value is in [0, 25%]. From this simple comparison, we see that the smaller the value of mindiff is, the greater chance it has to stop splitting. Second, we also observe a phenomenon that the accuracy increases as we increase the value of mindiff. The possible reason is that when the value of mindiff gets larger, a node will have a sharper boundary between large labels and small labels. This makes the labels of a leaf node have a higher similarity, and hence improves its accuracy. However, we should not go too far in this direction. Otherwise, when the value of mindiff is too large, a node will keep splitting until the number of data in that node is less than minqty. This will result in a very huge tree where each leaf node (rule) can only be applied to very few data.
Table 5 Accuracies, times and numbers of rules for different parameters (1) minsup ¼ 50%, mindiff ¼ 15%, minqty ¼ 6 and ub ¼ 6
(2) Training set size ¼ 6000,mindiff ¼ 15%, minqty ¼ 6 and ub ¼ 6
(3) Training set size ¼ 6000, minsup ¼ 50%, minqty ¼ 6 and ub ¼ 6
Training set size
Accuracy
Time (s)
Number of rules
minsup (%)
Accuracy
time (s)
Number of rules
mindiff
Accuracy
Time (s)
Number of rules
2000 4000 6000 8000 10000
0.603487 0.642753 0.642493 0.639700 0.636080
1019 1576 2595 3899 4744
100 99 137 195 212
40 45 50 55 60
0.612210 0.632383 0.642493 0.643367 0.596300
3189 3134 2595 2550 2159
270 234 137 130 69
5% 10% 15% 20% 25%
0.398570 0.590513 0.642493 0.651813 0.659960
4 1676 2595 2936 3849
1 75 137 259 490
(4) Training set size ¼ 6000, minsup ¼ 50%, mindiff ¼ 15%, and ub ¼ 6
(5) Training set size ¼ 6000, minsup ¼ 50%, mindiff ¼ 15%, minqty ¼ 6
minqty
Accuracy
Time (s)
Number of rules
ub
Accuracy
Time (s)
Number of rules
2 4 6 8 10
0.642927 0.642927 0.642493 0.642960 0.642430
2565 2527 2595 2532 2511
137 137 137 134 125
2 4 6 8 10
0.627283 0.654957 0.642493 0.648457 0.618783
1257 1881 2595 3241 4432
16 85 137 250 495
208
Y.-L. Chen et al. / Expert Systems with Applications 25 (2003) 199–209
Table 5(4) shows different accuracies, times and the numbers of rules for different values of minqty, including 2, 4, 6, 8 and 10. From the experiment, we did not see any big difference when we change the value of minqty, not only in accuracy but also in running time. The possible reason is that the value of minqty is still not large enough. Theoretically, if we set it as a much larger number, then nodes will stop growing much earlier. This will greatly shorten the running time and deteriorate the accuracy as well. Table 5(5) shows different accuracies, times and the numbers of rules for the different values of ub including 2, 4, 6, 8 and 10. The result indicates that the running time and the number of rules increase as we increase the value of ub. This result matches our expectation that the larger the value of ub is, the more splitting branches a node has, and therefore, the more running time the algorithm spends. Here, we also observe a very interesting phenomenon that the best accuracy is achieved when the value of ub is 4. The reason is that when the value of ub is too small, the range of a partitioned interval may be too wide, and many data tuples are put together although they should not be. On the other hand, when the value of ub is too large, the range of a partitioned interval may be too narrow, and many data tuples are separated although they should be put together. That is why the accuracy declines as the value of ub approaches either extreme.
5. Conclusion Most decision tree classifiers are designed to classify the objects whose attributes and class labels are single values. There exists no algorithm that can classify the objects with multi-valued attributes and multi-labels. This paper, thus, has aimed to propose an algorithm to build tree classifier for this kind of data. This paper has some possible extensions. One possible extension refers to the application of data generalization and specialization in the areas of objected-oriented data (Han et al., 1998), multi-dimension model in data warehousing (Chaudhuri & Dayal, 1997), or attribute-oriented induction in machine learning (Han et al., 1992). In all these applications, the data values of certain attributes may form different abstraction levels, from the most general concept to the most specific concept. Like counties of Vancouver, Richmond and West Vancouver can be generalized into Greater Vancouver Area, which in turn can be generalized into Canada, then into North America, and finally into Globe. In such a case, if we attach concept hierarchies with attributes or class labels, then the data will be hierarchical instead of flat. The way to approach the problem like this is to build classifiers for multi-valued, multi-labeled and multileveled data. The next possible extension is the study about how to prune the tree without losing accuracies. This treepruning problem has been studied by many researches
(Rastogi, 2000). However, since the traditional pruning methods are used to simplify the tree designed for singlevalued and single-labeled data, it needs some studies to find methods suitable for our multi-valued and multi-labeled data. Finally, another possible extension refers to the possibility that an attribute value can be an object of certain object type. In fact, this is equivalent to classify the objectoriented data. Since object-oriented data has become widespread in various domains such as engineering, business, science, software engineering and information retrieval, it would be a great loss if we were unable to classify the objectoriented data. Unfortunately, this sad story is true until now. Hopefully, the future researchers may find methods to conquer this serious problem.
Acknowledgements The research was supported in part by MOE Program for Promoting Academic Excellence of Universities under the Grant Number 91-H-FA07-1-4.
References Agrawal, R., Ghosh, S., Imielinski, T., Iyer, B., & Swami, A (1992). An interval classifier for database mining applications. Proceedings of the 18th International Conference on Very Large Databases (pp. 560– 573). Vancouver, BC Agrawal, R., Imielinski, T., & Swami, A. (1993). Database mining: a performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914 –925. Bramer, M. (2002). Using J-pruning to reduce overfitting in classification trees. Knowledge-Based Systems, 15(5/6), 301 –308. Chaudhuri, S., & Dayal, U. (1997). An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26(1), 65–74. Codd, E. F. (1972). Further normalization of the data base relational model. Data base systems, courant computer science symposium series 6, Englewood Cliffs, NJ: Prentice-Hall, pp. 33–64. Han, J., Cai, Y., & Cercone, N (1992). Knowledge discovery in databases: an attribute-oriented approach. Proceedings of International Conference on Very Large Databases (pp. 547 –559). Vancouver, Canada Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco, CA: Morgan Kaufmann. Han, J., Nishio, S., Kawano, H., & Wang, W. (1998). Generalization-based data mining in object-oriented databases using an object-cube model. Data and Knowledge Engineering, 25(1/2), 55–97. Mehta, M., Agrawal, R., & Rissanen, J (1996). SLIQ: A fast scalable classifier for data mining. Proceedings of the Fifth International Conference on Extending Database Technology (pp. 18–32). Avigon, France Moshkovich, H. M., Mechitov, A. I., & Olson, D. L. (2002). Rule induction in data mining: effect of ordinal scales. Expert Systems with Applications, 22(4), 303–311. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81 –106. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. Rastogi, R. (2000). PUBLIC: a decision tree classifier that integrates building and pruning. Data Mining and Knowledge Discovery, 4(4), 315 –344.
Y.-L. Chen et al. / Expert Systems with Applications 25 (2003) 199–209 Ruggieri, S. (2002). Efficient C4.5. IEEE Transactions on Knowledge and Data Engineering, 14(2), 438–444. Shafer, J. C., Agrawal, R., & Mehta, M (1996). SPRINT: A scalable parallel classifier for data mining. Proceedings of the 22nd International Conference on Very Large Databases (pp. 544 – 555). Mumbai (Bombay), India Steinberg, D., & Colla, P. L. (1995). CART: Tree-structured nonparametric data analysis. San Diego, CA: Salford Systems. Umano, M., Okamoto, H., Hatono, I., Tamura, H., Kawachi, F., Umedzu, S., & Kinoshita, J (1994). Fuzzy decision trees by fuzzy ID3 algorithm and its
209
application to diagnosis systems. Proceedings of the third IEEE International Conference on Fuzzy Systems, Vol. 3 (pp. 2113–2118). Orlando, FL Wang, M., Iyer, B., & Vitter, J. S (1998). Scalable mining for classification rules in relational databases. Proceedings of International Database Engineering and Applications Symposium (pp. 58–67). Cardiff, Wales, UK Wang, H., & Zaniolo, C (2000). CMP: A fast decision tree classifier using multivariate predictions. Proceedings of the 16th International Conference on Data Engineering (pp. 449 –460)