European Journal of Operational Research 174 (2006) 930–944 www.elsevier.com/locate/ejor
Computing, Artificial Intelligence and Information Management
TASC: Two-attribute-set clustering through decision tree construction Yen-Liang Chen *, Wu-Hsien Hsu, Yu-Hsuan Lee Department of Information Management, National Central University, Chung-Li, 320 Taiwan, ROC Received 30 October 2003; accepted 14 April 2005 Available online 27 June 2005
Abstract Clustering is the process of grouping a set of objects into classes of similar objects. In the past, clustering algorithms had a common problem that they use only one set of attributes for both partitioning the data space and measuring the similarity between objects. This problem has limited the use of the existing algorithms on some practical situation. Hence, this paper introduces a new clustering algorithm, which partitions data space by constructing a decision tree using one attribute set, and measures the degree of similarity using another. Three different partitioning methods are presented. The algorithm is explained with illustration. The performance and accuracy of the four partitioning methods are evaluated and compared. 2005 Elsevier B.V. All rights reserved. Keywords: Data mining; Clustering; Decision tree
1. Introduction Clustering is the process of grouping a set of objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Due to its widespread usage in many applications, many clustering techniques have been developed (Ankerst et al., 1999; Basak and Krishnapuram, 2005; Bezdek, 1981; Chen et al., 2003; Friedman and Fisher, 1999; Grabmeier and Rudolph, 2002; Guha et al., 1998; Keim and Hinneburg, 1999; Jain et al., 1999; Kantardzic, 2002; Karypis et al., 1999; Klawonn and Kruse, 1997; Liu et al., 2000; Yao, 1998). In Han and Kamber (2001), the authors classify existing clustering techniques into five major categories:
*
Corresponding author. Tel.: +86 886 3 4267266; fax: +86 886 3 4254604. E-mail address:
[email protected] (Y.-L. Chen).
0377-2217/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2005.04.029
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
X1 < 2
931
X1 ≥ 2 C1 : X1 < 2 C2:X1 ≥ 2 ^ X2 < 5
X2 < 5 Y
Y
X2 ≥ 5 N
Fig. 1. An example of clustering through decision tree construction.
(A) partitioning-based clustering algorithms, such as K-means, K-medoids, and CLARANS, (B) hierarchical-based clustering algorithms, which includes agglomerative approach (CURE and Chameleon) and divisive approach (BIRCH), (C) density-based algorithms, such as DBSCAN and OPTICS, (D) grid-based algorithms, such as STING and CLIQUE, (E) model-based algorithms, such as COBWEB. All the techniques mentioned above have a common limitation that only one set of attributes is considered for both partitioning the data space and measuring the similarity between objects. Which may not be applied to some practical situation when two sets of attributes are required to accomplish the job (two-attribute-set problem). Consider the following scenario: sales departments often need to cluster their customers so that different promotion strategies could be applied accordingly. Some promotion strategies are designed for certain groups of customers according to their consumption behaviors (first attribute set), e.g., average expenditure, frequency of consumption, etc. Hence, customers should be clustered by the attribute set of consumption behaviors. On the otherhand, in order to apply a promotion strategy to suitable customers, the department might need to know the characteristics of every single cluster from the customersÕ personal information (second attribute set), e.g., age, gender, income, occupation, and education. Furthermore, the department may want to use the patterns of the second attribute set to select potential customers before they can obtain any consumption information, and apply according promotion to those customers. This explains the requirement of using two sets of attributes for the dataset-partitioning task and similarity-measuring task. We adopt the concept of decision tree construction to solve the two-attribute-set problem in clustering analysis. In Fig. 1, the first and the second nodes are labeled as Y, meaning that the number of objects of label Y is larger than that of N (a dense space); the third node is labeled as N, meaning that the number of objects of label N is larger than that of Y (a sparse space). Thus, the result of the clustering produces two nodes with label Y. The feature of cluster 1 is X1 < 2, and that of cluster 2 is X1 P 2 and X2 < 5. In this paper, we present a new clustering algorithm TASC, which allows different attribute sets for dataset-partitioning task (tree-construction) and clustering task (similarity-measuring) to give more flexibility to their applications. This paper is divided into five sections. Section 2 gives the description and the definition to the issue in question. Section 3 introduces the clustering algorithm. Experimental results are presented in Section 4, and we make the conclusion of the research and a future perspective to this issue in Section 5.
2. Problem statement and definitions Given a dataset X and an attribute set P = {P1, P2, . . ., Pr}, where P1, P2, . . ., Pr are all numerical attributes in X. Classifying attribute set A and clustering attribute set C are two subsets of P, where A and C are of the following relationship: A \ C ¼ fP 1 ; P 2 ; . . . ; P s g;
where P 1 ; P 2 ; . . . ; P s are arbitrary attributes of P .
932
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
Table 1 A sample dataset
001 002 003 004 005 006 007
Age
Income
Average amount
Frequency
20 23 31 36 42 44 48
10 30 45 100 200 100 130
2 5 1 2 10 2 3
6 6 3 2 10 8 7
The goal of the algorithm is to construct a decision tree with the attributes in A as non-leaf nodes to segmentation the data space. Each leaf node represents a cluster where all records in the node are of the satisfactory similarity measured by the attributes in C. In order to gain a better understanding, we use Table 1 to explain the following definitions. Definition 1. A is a subset of P. A = {A1, A2, . . ., Am} is a classifying attribute set if every Ai in A is used for constructing the decision tree. Each Ai is called a classifying attribute. Ex: A ¼ fA1 ; A2 g;
where A1: age; A2: income.
Definition 2. C is a subset of P. C = {C1, C2, . . ., Cn} is a clustering attribute set if every Ci in C is to be used for measuring the similarity between records in S. Each Ci is called a clustering attribute. Ex: C ¼ fC 1 ; C 2 g;
where C 1: average amount; C 2: frequency.
Definition 3. For each Ai and Ci, the value of record x is denoted as Ai(x) and Ci(x) respectively. Ex: C 1 ð003Þ ¼ 1;
C 2 ð003Þ ¼ 3;
and
A1 ð003Þ ¼ 31.
Definition 4. If we partition an attribute Pi into k intervals, the jth interval of Pi is denoted as P ji . For each Ai, the jth interval of the attributes can be denoted as Aji . Ex: A11: age < 30;
A21: 30 6 age < 40
and
A31: age P 40.
Definition 5. If we use Ai to partition node S into k sub-nodes, then we have S ¼ fs1i ; s2i ; . . . ; ski g; sji ¼ fxjx 2 S; Ai ðxÞ 2 Aji g, for 1 6 j 6 k; and syi \ szi ¼ ;, for 1 6 y 6 k, 1 6 z 6 k and y 5 z. Ex: From Table 1; let S ¼ f001; 002; 003; 004; 005; 006; 007g. If we use A1 to classify node S into three intervals with A11: age < 30; A21: 30 6 age < 40 A31:
age P 40;
then we have
s11
¼ f001; 002g;
s21
¼ f003; 004g;
s31
and
¼ f005; 006; 007g.
Definition 6. The space volume of node S. (a) Let Max(Ci(S)) denote the maximum value of Ci of all the records in S and similarly Min(Ci(S)) the minimum value of Ci, where 1 6 i 6 n. Ex: For C 1 and S ¼ f001; 002; 003; 004; 005; 006; 007g;
we have MaxðC 1 ðSÞÞ ¼ 10 and
MinðC 1 ðSÞÞ ¼ 1; where the maximum value occurs at record 005 and the minimum at record 003.
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
933
(b) Diff(Ci(S)) = Max(Ci(S)) Min(Ci(S)). Ex: DiffðC 1 ðSÞÞ ¼ 10 1 ¼ 9;
DiffðC 2 ðSÞÞ ¼ 10 2 ¼ 8.
(c) The space volume of node S, represented by P(S), is defined as
Qn
i¼1 DiffðC i ðSÞÞ.
Ex: P ðSÞ ¼ 9 8 ¼ 72. Definition 7. The volume of a degenerate space. Attribute Ch is called degenerate in node sji if DiffðC h ðsji ÞÞ ¼ 0. In this case, we define Q Qn j j P ðsji Þ ¼ h1 y¼1 DiffðC y ðsi ÞÞ y¼hþ1 DiffðC y ðsi ÞÞ. Ex: For s11 ¼ f001; 002g;
we have MaxðC 2 ðs11 ÞÞ ¼ 6 and MinðC 2 ðs11 ÞÞ ¼ 6. Thus; we have
P ðs11 Þ ¼ DiffðC 1 ðs11 ÞÞ ¼ 3. Definition 8. The density of a node. The density of node S, D(S), is defined as jSj/P(S). Ex: DðSÞ ¼ 7=72. Definition 9. The extreme values of classifying attributes in a node. Let MaxðAi ðsji ÞÞ denote the maximum value of Ai of all the records in sji and similarly MinðAi ðsji ÞÞ the minimum value of Ai, where 1 6 i 6 m. Ex: For A1 and s31 ¼ f005; 006; 007g;
we have MaxðA1 ðs31 ÞÞ ¼ 48 and MinðA1 ðs31 ÞÞ ¼ 42.
Definition 10. The expected number of records in a node. Let nji denote the expected number of records in node sji if node sji has the same density as that of node S. Then, nji ¼ DðSÞ P ðsji Þ. Ex: Since DðSÞ ¼ 7=72 and P ðs31 Þ ¼ 24;
we have n31 ¼ 7=72 24 ¼ 7=3 ffi 2.33.
Definition 11. Sparse nodes and dense nodes. (a) Node S is sparse and will be considered as noise and removed from the dataset if: (1) D(S) < a * D(R), where a is a user-defined factor for the threshold of minimum density of a node and R is the parent node of S. That is, the density of the node S is too low. (2) jSj < c * jXj, where is c a user-defined factor for the threshold of minimum number of records of a node. That is, there are too few data existing in the node S. (b) Node S is dense if D(S) is greater than b * D(R), where b is the factor for threshold density of a dense node, and b > 1. A dense node will become a leaf-node without further process. Definition 12. Virtual nodes. A virtual node ux;y i ð1 6 x < k; 1 < y 6 k; x < yÞ is not really created but is used to denote the node obtained by combining real nodes sxi ; sxþ1 ; . . . ; sy1 and syi . For the virtual node ux;y i i i , we need to compute three of its properties, which are described as follows.
934
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
(a) The number of records in the node: jux;y i j ¼
Py
z z¼x jsi j.
2 3 Ex: u2;3 1 is a virtual node obtained by combing s1 and s1 . The example in Definition 4 sets
s21 ¼ f003; 004g and s31 ¼ f005; 006; 007g; so; we have js21 j ¼ 2; js31 j ¼ 3 and ju2;3 1 j ¼ 2 þ 3 ¼ 5. Qn x;y (b) The volume of the node: P ðui Þ ¼ q¼1 ð/q pq Þ, where /q ¼ maxw fmaxðC q ðswi ÞÞg for x 6 w 6 y and pq ¼ minw fminðC q ðswi ÞÞg for x 6 w 6 y. Ex: Because s21 ¼ f003; 004g and s31 ¼ f005; 006; 007g; maxðC 1 ðs31 ÞÞ ¼ 10;
we have maxðC 1 ðs21 ÞÞ ¼ 2;
minðC 1 ðs21 ÞÞ ¼ 1; minðC 1 ðs31 ÞÞ ¼ 2; maxðC 2 ðs21 ÞÞ ¼ 3; maxðC 2 ðs31 ÞÞ ¼ 10;
minðC 2 ðs21 ÞÞ
¼ 2 and minðC 2 ðs31 ÞÞ ¼ 7. Accordingly; we obtain /1 ¼ maxð2; 10Þ ¼ 10; p1 ¼ minð1; 2Þ ¼ 1; /2 ¼ maxð3; 10Þ ¼ 10 and p2 ¼ minð2; 7Þ ¼ 2. Finally; P ðu2;3 1 Þ ¼ ð/1 p1 Þ ð/2 p2 Þ ¼ ð10 1Þ ð10 2Þ ¼ 72. x;y x;y (c) The density of the node: Dðux;y i Þ ¼ jui j=P ðui Þ.
3. The clustering algorithm TASC is basically a decision-tree-construction algorithm. It employs a top–down and divide-and-conquer strategy to construct a decision tree. Fig. 2 shows the pseudo codes of TASC. Firstly, node S is examined. If S is a dense node (Definition 11(b)) already, it is marked as a leaf node and the process ends. Otherwise, node S needs to be partitioned. Lines 6–8 tentatively use each classifying attribute to partition S and calculate the fitness of each attribute by measuring its diversity or the entropy. Lines 10–14 then find the candidate with the highest fitness among those which have at least one subset with higher density than that of S. Lines 15–16 check if any candidate attribute is found, and mark S as a leaf node if none is found. Having determined the candidate classifying attribute, Lines 18–20 construct new sub-nodes for all S ji in S which are not sparse (Definition 11(a)), and the Build_Tree is called recursively for each new sub-node. This algorithm adopts two measures, Entropy and Diversity, for calculating the fitness degree to find the most-suitable partitioning attribute. The two measures are discussed in Section 4.1. Three partitioning methods, minimum entropy partitioning (MEP), equal-width binary partitioning (EWP) and equal-depth binary partitioning (EDP), are explained in Sections 3.2–3.4 respectively. The entropy measure is applied to the MEP, while the diversity measure is with the latter two methods, EWP and EDP. 3.1. Two measures of fitness 3.1.1. Entropy measure Suppose that the records in node S are marked with label ÔY,Õ and suppose that there are jSj virtual records with label ÔNÕ spreading uniformly across the space region of S. If we choose a certain point on the classifying property Ai to be partitioned, the original space would be partitioned into two sub-spaces, where the two sub-spaces might both contain some records marked with Y and some marked with N. The number of records with label Y in the sub-space P ðsji Þ would be jsji j. As to the virtual records of label N, since they spread uniformly across the space S, the number of records in P ðsji Þ with label N is DðSÞ P ðsji Þ. With these values, we calculate the entropy value for a given slitting point on a certain attribute. Finally, the cutting point with the lowest entropy value among all attributes is selected to partition the current node. Since the entropy theory is widely known, we skip its details. For details, readers may refer to Quinlan (1993, 1996) and Ruggieri (2002).
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
935
Fig. 2. The pseudo code of the algorithm.
3.1.2. Diversity measure When a node S is cut into several sub-nodes, the densities of some sub-nodes will be higher than that of node S, while some others will be lower. The strategy is to find the dense regions which are likely to become clusters, and filtered out the sparse regions from consideration. The density of node S is noted as D(S). Having partitioned by Ai the jth sub-node of S has the following results: nji ¼ DðSÞ P ðsji Þ is the expected number of records and jsji j is the actual number of records. According to the above strategy, the larger the value of jjsji j nji jj becomes, the better the result gets. Furthermore, since each branch has a different number of records, we multiply the deviation jjsji j nji jj by weight wji ¼ ðjsji j=jSjÞ. The sum Pk of the weighted changes in all sub-nodes is defined as its diversity. Therefore, we have the diversity j¼1 ðjjsji j nji j jsji j=jSjÞ. 3.2. Minimum entropy partitioning (MEP) Since all given attributes are numerical, we choose the best partitioning point from the mid points of all pairs of two successive adjacent values. The mid point that possesses the smallest entropy value is to be the real cutting point of the classifying attribute (Quinlan, 1996). In this study, for the sake of efficiency, we employ grid-based method introduced in Cheng et al. (1999) and Agrawal et al. (1999). The classifying attribute is partitioned into a fixed number of intervals of the same length. Then the entropy value of each cutting point is calculated. Lastly, we pick the point with the smallest entropy value
936
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
to be the real cutting point of the classifying attribute. In this method, the larger the value range of the classifying attribute is, the more the cutting points are to be calculated. Thus, this method produces two sub-nodes: the records in the first sub-node have the values smaller than the cutting point, while the second sub-node includes the records in S whose values are no less than the cutting point. 3.3. Equal-width binary partitioning (EWP) EWP contains two stages: partitioning stage and merging stage. In the first stage, we repetitively partition each interval into two equal-length non-overlapped intervals until no further partition is possible. Then the second phase repetitively merges the pair of adjacent intervals if the newly merged interval gains better result. The task stops when there is no possible improvement. Details of the method are explained in the following paragraphs. 3.3.1. Stage 1: Partitioning stage Suppose we use the classifying attribute Ai to partition node S into two sub-nodes, s1i and s2i . If the density of any sub-node is increased by more than d%, then there are two possibilities: (1) if the sub-node is a dense node, then we stop expanding the sub-node, for it becomes a leaf node; (2) otherwise, we further partition the interval associated with the sub-node. Here, we further partition the sub-node in an attempt to find dense nodes from its descendants. As shown in Fig. 3, the classifying attribute Ai is firstly employed to cut the node S into two sub-nodes s1i and s2i . If the increase of the density of s1i is higher than d but the leaf-node condition has not been reached, s1i becomes a new node and Binary_Cut is called recursively to cut s1i . Likewise, the same procedure is applied to s2i . The process is recursively executed until the leaf-node condition is reached. 3.3.2. Stage 2: Merging stage The second stage improves the outcome of last stage by merging pairs of adjacent intervals if the merged pair has higher density. The process of merging is as follows: (A) For every two adjacent nodes, calculate the density of their combination. (B) Find those combinations whose densities are higher than both of their constituent nodes. (C) Select the combination with the highest density, and generate the node Y by combining its constituent nodes.
Fig. 3. Partitioning stage of EWP method.
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
937
(D) For each node X adjacent to Y, calculate the density of the combination of X and Y. Note that there are at most two nodes adjacent to Y. (E) Repeat the loop from (B) to (D) until no possible combination is found. Let us use Fig. 4 as an example. Ai is used to cut the interval into seven sub-intervals as well as seven sub6;7 a;b nodes. Suppose we, according to Definition 12, generate six virtual nodes, from u2;3 i to ui , where ui is the virtual node obtained by combining intervals from the ath interval to the bth interval. Assume that the density of the virtual node u2;3 is higher than that of either s2i or s3i ; the density of u4;5 is higher than that of i i 4 5 either si or si . Then, we have two combinations that can be considered for merging. In Fig. 4, these two combinations are marked with ‘‘H’’. Further assume that the density of u2;3 is higher than that of u4;5 i i . So, we will actually merge the second 2;3 and the third intervals, and that means ui now becomes an actual node. Since u2;3 has two adjacent interi vals, the first interval and the fourth interval, we need to re-compute their combinations. The combination 2;4 with the first interval produces virtual node u1;3 i , and the combination with the fourth interval produces ui . In Fig. 5, we show the situation after the actual combination, and we use ‘‘J’’ to indicate that these two intervals have been actually merged.
Fig. 4. u2;3 and u4;5 can be considered for merging. i i
Fig. 5. After merging u2;3 i .
938
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
3.4. Equal-depth binary partitioning (EDP) The process of EDP is similar to that of EWP. They use the same measure and the same partition method. The only difference lies in how they partition an interval into two sub-intervals. EWP partitions an interval into two sub-intervals of the same length, while EDP partitions with the same number of data records. For example, if node S has eight records, whose values on attribute Ai are 1, 2, 4, 4, 5, 6, 10, and 12. Then EWP will take the average value of the lowest value and highest value among as the partition point (in this example, 6.5), while EDP takes the median value as the partition point (in this example, 4.5). Since these two methods are similar to each other, we omit the details.
4. Experiments Though the goal of a clustering process is to maximize the similarity within a cluster and the dissimilarity between clusters, the expected and obtained results often differ from one to another, due to different selected attributes. In this study, we use two attribute sets for partitioning the dataset and calculating the similarity. Thus, the goal is not only to maintain good similarity within a cluster but also to maximize the accuracy obtained by the decision tree. Three programs, MEP, EWP, and EDP are tested on a Celeron 1.7G Windows-2000 system with 768 MB of main memory and JVM (J2RE 1.3.1-b24) as the Java execution environment. The evaluation of efficiency and accuracy is done with synthetic data sets and reported in Sections 4.1 and 4.2 (please read Appendix A for details of the generation of synthetic data). A decision tree created according to a real data set is presented in Section 4.3. The discussion about the experiments is in Section 4.4. 4.1. Efficiency evaluation In this experiment, we set m = 8, n = 8, mc = 6, nc = 6, k = 8, c = 5 but leave jXj and dis as variants, in order to compare the runtime of the three methods for different numbers of records. In both normal distribution and uniform cases, as the number of data records increases, the runtime of EDP and EWP methods only increase slightly but that of MEP increases significantly. The result are shown in Figs. 6 and 7.
8000 7000
Time (s)
6000 5000
EWP
4000
EDP
3000
MEP
2000 1000 0 100000
200000 300000 400000 Number of records
500000
Fig. 6. Normal distribution: run time vs. number of records.
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
939
7000 Time (s)
6000 5000
EWP
4000
EDP
3000
MEP
2000 1000 0 100000 200000 300000 400000 500000 Number of records
Fig. 7. Uniform distribution: run time vs. number of records.
4.2. Accuracy evaluation In this section, we compare the accuracy of the three methods against different datasets. In order to measure the performance of the methods, we define the calculation of accuracy in the next paragraph. This accuracy is used for evaluating the performance of the three methods. The accuracy of this experiment is calculated as following: after clusters are formed, we calculate the distance between each data object and its assigned cluster. All data objects within the minimum distance are considered as correct clustering. The accuracy rate as the ratio of the number of data objects that are clustered correctly to the total number of data objects. From the experiment, we found that the accuracy is independent from the size of datasets. For normal distribution datasets, as seen in Fig. 8, MEP gets the best accuracy of around 0.8, while the other two are similar to each other, varying from 0.45 to 0.65. Fig. 9 shows the result of the case of uniform distribution. In that experiment, all three methods remain acceptable accuracy as the size of dataset grows. On the otherhand, from the nature of the two-set-problem, the dependency between the two attribute sets is also an important factor that affects the obtained accuracy. The dependency varies when different datasets being used or different attribute sets being selected. A hint for attribute selection is given in Section 4.4.2. 4.3. The TASC tree built from real dataset In order to give a lucid example, we use the hitter file obtained from http://lib.stat.cmu.edu/databsets/ baseball.data to build up a TASC tree. The hitter file consists of data on the regular and leading substitute hitters in 1986 and 1987, and contains 24 attributes and 322 data objects. Among the 24 attributes, 7 nonnumerical attributes are removed from our experimental data due to the limitation of our algorithm,
Accuracy rate
1 0.8 EWP
0.6
EDP
0.4
MEP
0.2 0 100000 200000 300000 400000 500000 Number of records
Fig. 8. Normal distribution: accuracy vs. number of records.
940
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
Accuracy rate
1 0.8 EWP
0.6
EDP
0.4
MEP
0.2 0 100000 200000 300000 400000 500000 Number of records
Fig. 9. Uniform distribution: accuracy vs. number of records.
namely, hitterÕs name, league, division, team, position, etc. Also, 61 of the data objects with missing values in the data file are deleted. The final dataset contains 17 attributes and 261 data objects. In the experiment, we take 6 attributes for classification and all 17 attributes as clustering attributes. The description of the attributes is shown in Table 2. We use EWP to construct this TASC tree. Four parameters are set as: a = 0.001, b = 40, c = 0.04, and d = 0.3. In the result, five clusters are found with the accuracy of 0.69. The TASC tree is shown in Fig. 10. 4.4. Discussion on experiments 4.4.1. Comparison of the three methods (A) EDP and EWP perform much faster than MEP. (B) For normal distribution datasets, MEP has the highest accuracy rate, followed by EDP and EWP. And for uniform distribution datasets, the three methods perform roughly the same in accuracy. There is no significant difference on accuracy between normal distribution and uniform distribution datasets. Table 2 Dataset attribute description Attribute description
Classification attribute
Clustering attribute
Number of times at bat in 1986 Number of hits in 1986 Number of home runs in 1986 Number of runs in 1986 Number of runs batted in 1986 Number of walks in 1986 Number of years in the major leagues Number of times at bat during his career Number of hits during his career Number of home runs during his career Number of runs during his career Number of runs batted during his career Number of walks during his career Number of put outs in 1986 Number of assists in 1986 Number of errors in 1986 1987 Annual salary on opening day in USD
A1 A2 A3 A4 A5 A6 – – – – – – – – – – –
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
941
A1 [19, 351]
[354, 687]
A5
[0, 31]
[14, 20]
A4
[21, 27]
[32, 64]
[28, 55]
A2
[43, 69] A3
[0, 6]
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Size: 11
Size: 18
Size: 13
Size: 43
Size: 157
Fig. 10. A TASC tree built from a baseball player dataset.
(C) From the experiments, we find that EDP is the most suitable method of the three since it takes less runtime yet maintains good accuracy. On the contrary, if the accuracy is most concerned, then MEP is the choice.
4.4.2. Attribute selection There are two attribute sets need to be defined: the classifying attribute set and the clustering attribute set. The clustering attribute set reflects the userÕs interest on the dataset. Therefore, users may choose the attributes according to the subject of the clustering task. For the classifying attribute set, users should choose the attributes which are relevant to those of the clustering attribute set. Those irrelevant attributes will not produce good results. Alternatively, since the algorithm selects the most fitted attribute automatically, users may put all attributes in the classifying attribute set if the runtime is not concerned. 4.4.3. Sensitivity of user-defined parameters In the experiments, we set the parameters as the following: 0.001 6 a < 0.1, increased by *10; 10 6 b 6 1000, increased by 10; 0.01 6 c 6 0.1, increased by 0.1; 0.1 6 d 6 1, increased by 0.1, and 0.01 6 d 6 0.1, increased by 0.01. The sensitivity of the four user-defined parameters for each method is concluded as follows: EWP (four parameters: a, b, c, and d) • When d = 0.3, we find the best accuracy; and when d > 0.5, the result is incorrect. • The accuracy reaches its highest value when c = 0.06 or 0.1, and b = 10. • b and c are independent to each other. When b and c are set to proper values, a acts no function to the result.
942
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
EDP (four parameters: a, b, c, and d) • We find that the accuracy is not sensitive to the value of a and d, i.e., the effect of the a and d is not significant. • There is no certain value for c and b which makes the best accuracy. The accuracy varies slightly when b changes from 10, 100, to 1000, yet the differences are always below 0.05. On the otherhand, when c < 0.04, the accuracy remains below 0.5, therefore, the suggested value of c shall be no less than 0.04. MEP (three parameters: a, b, and c) • We find that the accuracy is not sensitive to the value of a, i.e., the effect of the a is not significant. • When c > 0.09, the accuracy drops below 0.6, but when b is properly set, the accuracy may raise above 0.6. Therefore, we suggest to firstly adjust the value of c until the accuracy reaches its best, then to tune the value of b to obtain better result.
5. Conclusion Most of the existing clustering algorithms consider only one attribute set and cannot be applied to the problem when two attribute sets are concerned. Our work relaxes this constraint so that the classifying attributes and the clustering attributes can be the same, partly different, or totally different. Two attribute sets are considered simultaneously in the process of clustering. In this paper, firstly we define the classifying attributes and the clustering attributes, and then we define the characteristics of nodes and sub-nodes, including the number of the records in a node, the space volume of a node, and the density of a node. Lastly, we define dense nodes and sparse nodes. Having given these definitions, we propose an algorithm with three variants. All these three variants are capable of clustering data into clusters with two different attribute sets. To evaluate this algorithm, we measure the efficiency and accuracy of three variants of the algorithm. Also, we demonstrate the capability of this algorithm by applying it to a real data set. The following are four possible extensions in the future: 1. Due to the ‘‘two-attribute-set’’ design, the accuracy obtained from the algorithm is dependent on the degree of dependency between the two attribute sets. Thus, future research could attempt to explore how the dependency between the two sets of attribute may affect the accuracy. 2. In the paper, we only perform sensitivity analysis based on synthetic data sets. In the future, we may conduct sensitivity analysis from a mixture of real world data sets rather than the artificially generated test data alone. Or we may consider conducting sensitivity analysis on a wider spectrum of data sets. Either way could provide for us a more thorough understanding of how the clustering results may be influenced by data set with different properties. 3. The three algorithms proposed in this paper all assume that the attributes are numerical attributes. This assumption is usually not the case in a real-life application, where we may encounter other kinds of attributes such as categorical, Boolean or nominal. Thus it would be worth considering how to design new two-attribute-set clustering algorithms that can cluster data with non-numerical data attributes. 4. In this paper, we developed the clustering algorithms based on the density-based approach. But besides density-based approach, in the past other approaches were also used to develop clustering algorithms, including distance-based, partition-based, hierarchical-based, model-based and grid-based approaches. Thus, in the future we might try to employ other approaches to solve the problem.
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
943
Acknowledgement The research was supported in part by MOE Program for Promoting Academic Excellence of Universities under the Grant No. 91-H-FA07-1-4.
Appendix A. Synthetic data generation We modify the data generation method developed in Liu et al. (2000) to produce the synthetic data. In this study, we use two different sets of attributes to do classifying and clustering. The values of all dimensions are between 0 and 100 ([0, 100]). To create a cluster, we first pick nc/mc dimensions from the classifying and the clustering attributes, and then randomly choose two numbers in [0, 100] for each of these dimensions. These two numbers decide the range in a dimension in which the cluster exists (different clusters might have some common dimensions and common ranges, but they do not have the same ranges in all dimensions). For the other unpicked dimensions, we assume that the values of a cluster spread uniformly in the range of [0, 100]. Having defined the ranges for every cluster, we assume that the number of records in each cluster is (1 e) * X/c. Then, we generate the data for each cluster in their ranges according to either normal distribution or uniform distribution. Finally, we generate e * X noise records outside the range of the cluster. Table A1 lists all the parameters used in generating the synthetic data. There are totally 26 combinations of parameters in the experiments, which is shown in Table A2. Let us use the data set X100000–m8–n8–k4– Table A1 The parameters used in generating the synthetic data jXj m n k c mc nc e dis N U
The number of records The number of classifying attributes The number of clustering attributes The number of common attributes The number of clusters The number of classifying attributes in a cluster The number of clustering attributes in a cluster The ratio of noise data, we fix it as 1% Data distribution Normal distribution Uniform distribution
Table A2 Combinations of parameters in the experiment X100000–m8–n8–k8–c5–mc6–nc6–disN X200000–m8–n8–k8–c5–mc6–nc6–disN X300000–m8–n8–k8–c5–mc6–nc6–disN X400000–m8–n8–k8–c5–mc6–nc6–disN X500000–m8–n8–k8–c5–mc6–nc6–disN X100000–m8–n8–k8–c10–mc6–nc6–disN X100000–m8–n8–k8–c15–mc6–nc6–disN X100000–m8–n8–k8–c20–mc6–nc6–disN X100000–m8–n8–k8–c25–mc6–nc6–disN X100000–m8–n8–k6–c5–mc6–nc6–disN X100000–m8–n8–k4–c5–mc6–nc6–disN X100000–m8–n8–k2–c5–mc6–nc6–disN X100000–m8–n8–k0–c5–mc6–nc6–disN
X100000–m8–n8–k8–c5–mc6–nc6–disU X200000–m8–n8–k8–c5–mc6–nc6–disU X300000–m8–n8–k8–c5–mc6–nc6–disU X400000–m8–n8–k8–c5–mc6–nc6–disU X500000–m8–n8–k8–c5–mc6–nc6–disU X100000–m8–n8–k8–c10–mc6–nc6–disU X100000–m8–n8–k8–c15–mc6–nc6–disU X100000–m8–n8–k8–c20–mc6–nc6–disU X100000–m8–n8–k8–c25–mc6–nc6–disU X100000–m8–n8–k6–c5–mc6–nc6–disU X100000–m8–n8–k4–c5–mc6–nc6–disU X100000–m8–n8–k2–c5–mc6–nc6–disU X100000–m8–n8–k0–c5–mc6–nc6–disU
944
Y.-L. Chen et al. / European Journal of Operational Research 174 (2006) 930–944
c5–mc6–nc6–disN for explanation: (1) X100000 indicates that the data set has 100,000 data records; (2) m8– n8–k4 means we have eight classifying attributes and eight clustering attributes, among which four attributes are common, i.e., these four are used both as classifying and clustering attributes; so, in total we have 8 + 8 4 = 12 attributes; (3) c5 means there are five clusters in total; (4) mc6–nc6–disN means the data of each cluster are spreading in specific ranges along six classifying attributes and six clustering attributes in accordance with the normal distribution.
References Ankerst, M., Breunig, M., Kriegel, H.-P., Sander, J., 1999. OPTICS: Ordering points to identify clustering structure. In: Proceedings of the ACM SIGMOD Conference, Philadelphia, PA, pp. 49–60. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., 1999. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 94–105. Basak, J., Krishnapuram, R., 2005. Interpretable hierarchical clustering by constructing an unsupervised decision tree. IEEE Transactions on Knowledge and Data Engineering 17 (1), 121–132. Bezdek, J.C., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York. Chen, Y.L., Hsu, C.L., Chou, S.C., 2003. Constructing a multi-valued and multi-labeled decision tree. Expert Systems with Applications 25 (2), 199–209. Cheng, C.H., Fu, A.W., Zhang, Y., 1999. Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. Friedman, J.H., Fisher, N.I., 1999. Bump hunting in high-dimensional data. Statistics and Computing 9 (2), 123–143. Grabmeier, J., Rudolph, A., 2002. Techniques of cluster algorithms in data mining. Data Mining and Knowledge Discovery 6 (4), 303– 360. Guha, S., Rastogi, R., and Shim, K., 1998. CURE: An efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD Conference, Seattle, WA, pp. 73–84. Han, J., Kamber, M., 2001. Data mining: Concepts and Techniques. Academic Press. Jain, A.K., Murty, M.N., Flynn, P.J., 1999. Data clustering: A review. ACM Computing Surveys 31 (3), 264–323. Kantardzic, M., 2002. Data mining: Concepts, Models, Methods, and Algorithms. Wiley-IEEE Press. Karypis, G., Han, E.-H., Kumar, V., 1999. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. Computer 32, 68–75. Keim, D., Hinneburg, A., 1999. Clustering techniques for large data sets: From the past to the future. KDD Tutorial Notes 1999, pp. 141–181. Klawonn, F., Kruse, R., 1997. Constructing a fuzzy controller from data. Fuzzy Sets and Systems 85, 177–193. Liu, B., Xia, Y., Yu, P., 2000. Clustering through decision tree construction. In: Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pp. 20–29. Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. Quinlan, J.R., 1996. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research 4, 77–90. Ruggieri, S., 2002. Efficient C4.5. IEEE Transactions Morgan Kaufmann Knowledge and Data Engineering 14 (2), 438–444. Yao, Y.Y., 1998. A comparative study of fuzzy sets and rough sets. Journal of Information Sciences 109, 227–242.