Clustering categorical data sets using tabu search techniques

Clustering categorical data sets using tabu search techniques

Pattern Recognition 35 (2002) 2783 – 2790 www.elsevier.com/locate/patcog Clustering categorical data sets using tabu search techniques Michael K. Ng...

121KB Sizes 0 Downloads 101 Views

Pattern Recognition 35 (2002) 2783 – 2790

www.elsevier.com/locate/patcog

Clustering categorical data sets using tabu search techniques Michael K. Ng∗;1 , Joyce C. Wong Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong, China Received 15 March 2001; received in revised form 10 September 2001; accepted 20 November 2001

Abstract Clustering methods partition a set of objects into clusters such that objects in the same cluster are more similar to each other than objects in di0erent clusters according to some de1ned criteria. The fuzzy k-means-type algorithm is best suited for implementing this clustering operation because of its e0ectiveness in clustering data sets. However, working only on numeric values limits its use because data sets often contain categorical values. In this paper, we present a tabu search based clustering algorithm, to extend the k-means paradigm to categorical domains, and domains with both numeric and categorical values. Using tabu search based techniques, our algorithm can explore the solution space beyond local optimality in order to aim at 1nding a global solution of the fuzzy clustering problem. It is found that the clustering results produced by the proposed algorithm are very high in accuracy. ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Clustering; k-means; k-modes; Tabu search; Numeric data; Categorical data

s.t.:

1. Introduction Partitioning a set of objects into homogeneous clusters is a fundamental operation in data science. The operation is required in a number of data analysis tasks, such as unsupervised classi1cation and data summation, as well as segmentation of large homogeneous data sets into smaller homogeneous subsets that can be easily managed, separately modelled and analyzed. A set of objects described by a number of attributes is to be classi1ed into several clusters such that each object is allowed to belong to more than one cluster with di0erent degrees of association. This fuzzy clustering problem can be represented as a mathematical optimization problem: n k   min F(W; Z) = wli d(zl ; xi ) (1) W; Z

l=1 i=1

∗ Corresponding author. Tel.: +852-28592252; fax: +85225592225. E-mail address: [email protected] (M.K. Ng). 1 Research supported in part by Hong Kong Research Grants Council Grant Nos. HKU 7147=99P and 7132=00P, and 10203408, 10203501.

0 6 wli 6 1; k 

wli = 1;

1 6 l 6 k; 1 6 i 6 n; 16i6n

(2) (3)

l=1

and 0¡

n 

wli ¡ n;

1 6 l 6 k;

(4)

i=1

where n is the number of objects, m is the number of attributes of each object, k(6 n) is a known number of clusters, X = {x1 ; x2 ; : : : ; xn } is a set of n objects with m attributes, Z = [z1 ; z2 ; : : : ; zk ] is an m × k matrix containing k cluster centers, W = [wli ] is an k × m fuzzy matrix and d(zl ; xi )(¿ 0) is a certain dissimilarity measure between the cluster center zl and the object xi . The above optimization problem was 1rst formulated by Dunn [1]. A widely known approach to this problem is the fuzzy k-means algorithm which was proposed by Ruspini [2] and Bezdek [3]. The fuzzy k-means algorithm is eGcient in clustering data sets. It is initiated by selecting a value for W , then the algorithm iterates between computing cluster centers, Z, given W and computing W , given Z. The algorithm terminates when two successive values of W or Z are

0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 0 2 ) 0 0 0 2 1 - 3

2784

M.K. Ng, J.C. Wong / Pattern Recognition 35 (2002) 2783 – 2790

equal. It has been shown that the fuzzy k-means algorithm converges [4 –7]. However, the algorithm may only stop at a local minimum of the optimization problem. This is because the function F(Z; W ) is non-convex in general. To aim at 1nding a global solution of combinatorial optimization problems, tabu search based techniques which were introduced by Glover [8] is applied. Tabu search based techniques are concerned with imposing restrictions to guide a searching process to negotiate otherwise diGcult regions. The searching procedures do not immediately terminate for a local optimal solution, but instead the procedures attempt to search beyond the local optimality, to aim at 1nding a global solution. Al-Sultan and Fedjki [9] have proposed a tabu search based algorithm for the fuzzy clustering problem for numeric data sets. Their proposed tabu search based algorithm has found to outperform the fuzzy k-means algorithm considerably in their tests. However, the fuzzy k-means algorithm only works on numeric data which limits the use in clustering where categorical data sets are frequently encountered. To deal with categorical data sets, Huang [10], and Huang and Ng [11] suggested the fuzzy k-modes algorithm. This algorithm extends the k-means algorithm by applying a simple matching dissimilarity measure for categorical objects and using modes instead of means for clusters. To cluster objects with mixed numeric and categorical attributes, Huang [10] also suggested the k-prototypes algorithm to further integrate the k-means and k-modes algorithms by de1ning a combined dissimilarity measure. The main aim of this paper is to develop a tabu search based clustering algorithm, extending from the fuzzy k-modes and k-prototypes algorithms, to aim at 1nding a global solution of the fuzzy numeric and categorical data clustering problems. The outline of the paper is as follows. In Sections 2 and 3, the fuzzy k-modes and k-prototypes algorithms are brieJy reviewed. In Section 4, tabu search based techniques are introduced. In Section 5, the new tabu search based clustering algorithm is proposed. In Section 6, the experimental results are presented to illustrate the e0ectiveness of our new approach. In Section 7, some concluding remarks are given for our algorithm. 2. Fuzzy k-modes algorithm The fuzzy k-modes algorithm is modi1ed from the fuzzy k-means algorithm by using a simple matching dissimilarity measure for categorical data, and replacing the means of clusters with the modes. These modi1cations remove the numeric-only limitation of the fuzzy k-means algorithm while maintain its eGciency in clustering categorical data sets. The simple matching dissimilarity measure between zl and xi for l = 1; 2; : : : ; k and i = 1; 2; : : : ; n, is de1ned as m  dc (zl ; xi ) ≡ (zlj ; xij ); (5) j=1

Table 1 Fuzzy k-modes algorithm Step 1: Choose an initial point Z (1) ∈ Rkm . Determine W (1) such that F(W; Z (1) ) is minimized. Set  = 1. Step 2: Determine Z (+1) such that F(W () ; Z (+1) ) is minimized. If F(W () ; Z (+1) ) = F(W () ; Z () ), then stop, otherwise go to Step 3. Step 3: Determine W (+1) such that F(W (+1) ; Z (+1) ) is minimized. If F(W (+1) ; Z (+1) ) = F(W () ; Z (+1) ), then stop, otherwise set  =  + 1 and go to Step 2.

where zl = [zl1 ; : : : ; zlm ]T ; xi = [xi1 ; : : : ; xim ]T and  0 if zlj = xij ; (zlj ; xij ) = 1 if zlj = xij :

(6)

The optimization problem for partitioning a set of n objects described by m categorical attributes into k clusters becomes k  n  wli dc (zl ; xi ) (7) min F(W; Z) = W; Z

l=1 i=1

subject to the same constraints (2) – (4). Minimization of F in Eq. (7) with the constraints in Eqs. (2) – (4) forms a class of constrained nonlinear optimization problems whose solution is unknown. The usual method towards optimization of F in Eq. (7) is to use partial optimization for Z and W . In this method, we 1rst 1x Z and 1nd necessary conditions on W to minimize F. Then we 1x W and minimize F with respect to Z. This process is formalized in the fuzzy k-modes algorithm as in Table 1. The matrices W and Z are formulated in the following methods. Let Z be 1xed, i.e., zl for l = 1; 2; : : : ; k are given, we can 1nd W by  1 if xi = zl ;      0 if xi = zh but h = l;   k wli = (8)  d (z ;x ) 1=( −1)   [ dcc(zhl ;xii ) ] if xi = zl and xi = zh ;  1=     h=1 16h6k for 1 6 l 6 k; 1 6 i 6 n. Let W be 1xed, we can 1nd Z by the k-modes update method. Each categorical object is described by m categorical attributes and its jth attribute has nj categories: (nj ) (2) for 1 6 j 6 m. Let the lth cluster center a(1) j ; aj ; : : : ; aj be zl = [zl1 ; zl2 ; : : : ; zlm ]T . Then F(W; Z) is minimized if and only if   where wli ¿ wli ; zlj = a(r) j (r)

i; xij =aj

1 6 t 6 nj :

(t)

i; xij =aj

(9)

M.K. Ng, J.C. Wong / Pattern Recognition 35 (2002) 2783 – 2790

2785

Table 2 Tabu search method Step 1: Initialization 1. Select an initial con1guration xnow . 2. Record the current best solution by setting xbest = xnow . Evaluate its corresponding objective function value and label it as best cost. Step 2: Choice and termination By using xnow , generate a candidate subset N (xnow ) and evaluate the corresponding objective function values. If the best, in terms of the objective function value, in N (xnow ) is not tabu or if it is tabu but satis1es the aspiration criteria, then pick it as xnext . Otherwise pick the 1rst non-tabu move as xnext . Terminate by a chosen iteration cut-o0 rule (e.g., a limit on the total number of iterations). Step 3: Update Put xnext into the tabu list. Reset xnow = xnext . If the objective function value of xnow is less than best cost, then go to Step 1.2. Then return to Step 2.

However, the fuzzy k-modes algorithm may only stop at a local optimal solution of the clustering problem. This means that the solution obtained can still be further improved. Therefore, tabu search based techniques are incorporated to aim at 1nding a global solution of the optimization problem (7).

3. Fuzzy k-prototypes algorithm The k-prototypes algorithm integrates the k-means and k-modes algorithms to cluster data sets with mixed numeric and categorical values. In this algorithm, a combined dissimilarity measure is de1ned to deal with both numeric and categorical attributes. The clustering process is similar to the k-means algorithm except that it uses the k-modes approach to update the categorical attribute values of cluster prototypes, so it preserves the eGciency of the k-means algorithm. Denote [X (n) |X (c) ]T = [x1 ; x2 ; : : : ; xn ] as a set of n objects with m mixed numeric and categorical attributes, in which X (n) represents the numeric attributes and X (c) represents the categorical attributes; and Z = [Z (n) |Z (c) ]T = [z1 ; z2 ; : : : ; zk ] as an m × k matrix containing k cluster centers, in which Z (n) represents the numeric attributes and Z (c) represents the categorical attributes. The combined dissimilarity measure between two objects, xl = [xl(n) |xl(c) ]T and xi = [xi(n) |xi(c) ]T with mixed numeric and categorical attributes is de1ned as dn (xl(n) ; xi(n) ) + dc (xl(c) ; xi(c) );

(10)

where dn (·; ·) is the Euclidean distance usually used in the k-means algorithm, dc (·; ·) is the simple matching dissimilarity measure used in the k-modes algorithm de1ned in Eq. (5) and is a weight to balance the numeric and categorical parts to avoid favoring either type of attribute. The optimization problem for fuzzy clustering a set of n objects described by m mixed numeric and categorical

attributes into k clusters is min F(W; Z) W; Z

=

k  n 

wli [dn (zl(n) ; xi(n) ) + dc (zl(c) ; xi(c) )]

(11)

l=1 i=1

subject to the same constraints (2) – (4). The matrices W and Z can be computed similar to the fuzzy k-modes algorithm. In particular, W can be updated using the same formulae (8) except replacing dc (·; ·) by the combined dissimilarity measure de1ned in Eq. (10), Z (c) can be obtained by Eq. (9), and Z (n) can be computed as follows: n i=1 wli xij ; 1 6 l 6 k; 1 6 j 6 m: zlj =  n wli i=1

4. Tabu search based techniques Tabu search method in Table 2 is based on procedures designed to cross boundaries of feasibility or local optimality, which are usually treated as barriers, and systematically to impose and release constraints to permit exploration of otherwise forbidden regions. Tabu search is a meta-heuristic that guides a local heuristic search procedure to explore the solution space beyond local optimality. A fundamental element underlying tabu search is the use of Jexible memory. A chief mechanism for exploiting memory in tabu search is to classify a subset of the moves in a neighborhood as forbidden or tabu. The basic elements of tabu search method are de1ned as follows: 1. Con3guration is an assignment of values to variables. It is a solution to the optimization problem. 2. Move is a speci1c procedure for getting a trial solution which is feasible to the optimization problem that is related to the current con1guration. 3. Neighborhood is the set of all neighbors, which are the “adjacent solutions” that can be reached from any current

2786

M.K. Ng, J.C. Wong / Pattern Recognition 35 (2002) 2783 – 2790

Table 3 Tabu search based categorical clustering algorithm Step 1: Initialization Let Z u be arbitrary centers and F u the corresponding objective function value. Let Z b = Z u and F b = F u . Select values for NTLM (tabu list size), P (probability threshold), NH (number of trial solutions), IMAX (the maximum number of iterations for each center), and ) (the iteration reducer). Let h = 1, NTL = 0 and r = 1. Go to Step 2. t , and evaluate their corresponding objective Step 2: Using Z u , 1x all centers and move center ziu by generating NH neighbors z1t ; z2t ; : : : ; zNH t . Go to Step 3. function values F1t ; F2t ; : : : ; FNH t ; : : : ; Ft t t Step 3: (a) Sort Fit , i = 1; : : : ; NH in a non-decreasing order and denote them as F[1] [NH ] . Clearly F[1] 6 : : : 6 F[NH ] . Let e = 1. If t ¿ F b , then replace h by h + 1. Goto Step 3(b). F[1] t ¡ F b , then let zu = z t u (b) If z[e] is not tabu or if it is tabu but F[e] [e] and F = F[e] and go to Step 4. Otherwise generate u ∼ U (0; 1) r t t and go to Step b u where U (0; 1) is a uniform density function between 0 and 1. If F ¡ F[e] ¡ F and u ¿ P, then let zru = z[e] and F u = F[e] 4; otherwise, go to Step 3(c). (c) Check for the next neighbor by letting c = c + 1. If e 6 NH , go to Step 3(a). Otherwise go to Step 3(d). (d) If h ¿ IMAX , then go to Step 5. Otherwise select a new set of neighbors by go to Step 2.

Step 4: Insert zru at the bottom of the tabu list. If NTL = NTLM , then delete the top of the tabu list; otherwise let NTL = NTL + 1. If F b ¿ F u , then let F b = F u and Z b = Z u . Go to Step 3 (4). Step 5: If r ¡ k, then let r = r + 1 and reset h = 1 and go to Step 2. Otherwise set IMAX = )(IMAX ). If IMAX ¿ 1, then let r = 1 and reset h = 1 and go to Step 2; otherwise stop. (Z b represents the best centers and F b is the corresponding best objective function value).

con1guration. It may also include neighbors that do not satisfy the given customary feasible conditions. 4. Candidate subset is a subset of the neighborhood. It is to be examined instead of the entire neighborhood, especially for large problems where the neighborhood have many elements. 5. Tabu restrictions are constraints that prevent the chosen moves to be reversed or repeated. They play a memory role for the search by making the forbidden moves as tabu. The tabu moves are stored in a list, called tabu list. 6. Aspiration criteria are rules that determine when the tabu restrictions can be overridden, thus removing a tabu classi1cation otherwise applied to a move. If a certain move is forbidden by some tabu restrictions then the aspiration criteria, when satis1ed, can make this move allowable.

5. Categorical data clustering Our new algorithm in Table 3 is to combine the fuzzy k-modes algorithm and tabu search based techniques in order to 1nd a global solution of the clustering problem of categorical data. In our algorithm, Eq. (8) is used to update the fuzzy partition matrix W . But we do not use Eq. (9) to update the cluster center Z. Instead Z is generated by the below method and is mapped into a value for each objective function value. This technique has been used by Al-Sultan and Fedjki [9]. Let Z t ; Z u ; Z b denote the trial, current and best cluster centers, and F t ; F u ; F b denote the corresponding trial, current

and best objective function values, respectively. A number of trial cluster centers Z t are to be generated through moves from the current cluster centers Z u . As the algorithm proceeds, the best cluster centers found so far is saved in Z b . The corresponding objective function values F t ; F u ; F b are also operated, respectively. In Table 3, there are also several parameters. They are described as below: 1. NTLM (tabu list size): It contains the history of the search and represents the maximum number of moves to be stored in the list. The larger (smaller, respectively) the value of NTLM , the stronger (less, respectively) the memory of the search and hence the search emphasizes diversi1cation (intensi1cation, respectively). 2. P (probability threshold): It is used to allow moves that are tabu but better than the current solution to be examined because this may lead to a better solution. 3. NH (number of trial solutions): It is the number of trial solutions generated for each center. The larger (smaller, respectively) the value of NH , the more (fewer, respectively) neighbors are examined and hence the search emphasizes diversi1cation (intensi1cation, respectively). 4. IMAX (maximum number of non-improving moves for each center): It decides on how many non-improving moves are allowed for each center before going to the next one. It is observed that when getting close to the solution, the time needed to examine a given center is reduced. Therefore, IMAX is determined to be a variable parameter instead of a 1x number. 5. ) (reduction factor for IMAX ): If IMAX non-improving moves are performed, then the next center is considered.

M.K. Ng, J.C. Wong / Pattern Recognition 35 (2002) 2783 – 2790

When all centers are considered, then IMAX is reduced by a factor ), where 0 ¡ ) ¡ 1, until it goes below 1, which corresponds to the stopping criteria. The smaller the value of ), the faster IMAX goes below 1 and hence the fewer passes through the centers the search makes, but this could be at the expense of the solution quality. 5.1. Generation of neighborhoods One of the most distinctive features of tabu search is the generation of neighborhoods. Since numeric data have naturally ordering, the neighborhood of the center zu is de1ned as follows: N (zu ) = {y = [y1 ; y2 ; : : : ; ym ]T | yi = ziu + -d; i = 1; 2; : : : ; m and d = 0; −1 or + 1}:

(12)

We note that when zu is close to the solution, a small step-size - can be used. The neighbors of zu can be generated by picking randomly from N (zu ). There are two kinds of categorical attributes, namely, ordinal and nominal. Ordinal attributes do have ordered levels, such as size and education levels. Their neighborhoods can be de1ned similarly as in Eq. (12) for numeric data sets. However, this approach cannot be applied to categorical data sets with nominal attributes since they do not have naturally ordering. In this paper, we propose to use the “distance” concept to make moves from the cluster center for categorical data sets. The neighborhood of zu is de1ned as follows: N (zu ) = {y = [y1 ; y2 ; : : : ; ym ]T | dc (y; zu ) ¡ d}

(13)

for some positive integers d. In our algorithm, we generate a set of neighbors which are of a certain distance d from the center, i.e., neighbors which have d attributes di0erent from the center. We remark that the distance d can be seen as the number of attributes changed for generating a neighbor, which is the criteria for selecting the neighborhood. These d attributes are randomly chosen among the m given attributes to change their values of categories, where 0 6 d 6 m. The greater (smaller, respectively) the value of d, the larger (smaller, respectively) the solution space to be examined and hence the search emphasizes diversi1cation (intensi1cation, respectively).

6. Experimental results The tabu search based clustering algorithm is coded in C++ programming language. Some data sets [10] are used to test for the algorithm to evaluate the performance of the algorithm.

2787

6.1. Data sets The 1rst data set is the soybean disease data set [10]. We choose this data set to test for the algorithm because all attributes of the data set be treated as categorical. The soybean data set has 47 records, each described by 35 attributes. Each record is labelled as one of the 4 diseases: diaporthe stem canker, charcoal rot, rhizoctonia root rot and phytophthora rot. Except for Phytophthora Rot which has 17 records, all other diseases have 10 records each. Of the 35 attributes we only selected 21 because the other 14 have only one category. The second data set is the credit approval data set [10]. This data set has 690 instances, each being described by 6 numeric and 9 categorical attributes. The instances are classi1ed into 2 classes, approved labelled as “+” and rejected labelled as “−”. Thirtyseven instances have missing values in 7 attributes. Since missing values in numeric attributes cannot be handled, 24 instances with missing values in numeric attributes are removed. Therefore, only 666 instances are considered. 6.2. Clustering accuracy We obtain the cluster memberships from the fuzzy matrix W as follows. The record xi for i = 1; 2; : : : ; n is assigned to the lth cluster if wli = max whi : 16h6k

(14)

If the maximum is not unique, then xi is assigned to the clusters 1rst achieving the maximum. A clustering result is measured by the clustering accuracy r de1ned as k rl r = l=1 ; (15) n where rl is the number of objects partitioned into the correct cluster l and n is the total number of objects in the data set. 6.3. Results We employ the fuzzy k-modes and tabu search based k-modes clustering algorithms to cluster the soybean disease data set into 4 clusters. The initial modes are arbitrarily chosen as the 1rst k records of the data set. For both algorithms, we specify = 1:1 as suggested as in the paper [11]. Each algorithm is run 100 times. In the following tests, we select ) = 0:75; P = 0:97 and IMAX = 100. In Table 4, it is found that the clustering accuracy of the tabu search based k-modes algorithm is very high. The average accuracy exceeds 99% and the number of runs that all records are correctly clustered into the 4 given clusters is 67 out of 100. However, the average clustering accuracy of the fuzzy k-modes algorithm is only about 80% and the number of runs that all records are correctly clustered into the 4 given clusters is 20 out of 100. It is clear that the tabu

2788

M.K. Ng, J.C. Wong / Pattern Recognition 35 (2002) 2783 – 2790

Table 4 Clustering accuracy of 100 runs when d = 3, NTLM = 100 and NH = 50 Number of runs that r = 1

0.991

67

0.790

20

0.99 Average accuracy

Tabu search based k-modes Fuzzy k-modes

Average accuracy

1

0.98

0.97

0.96

Table 5 Clustering results of tabu search based k-modes algorithm NTLM 50

NH

d

Accuracy

Number of runs that r = 1

20

3 5 7 3 5 7 3 5 7

0.989 0.980 0.959 0.986 0.987 0.982 0.990 0.987 0.981

65 64 45 65 64 53 68 62 52

3 5 7 3 5 7 3 5 7

0.985 0.983 0.953 0.991 0.988 0.971 0.988 0.987 0.981

51 60 44 67 62 54 61 58 54

3 5 7 3 5 7 3 5 7

0.989 0.983 0.962 0.990 0.988 0.979 0.988 0.985 0.982

67 64 39 64 65 58 62 53 64

50 100

100

20 50 100

200

20 50 100

search based k-modes algorithm produces a more accurate clustering result. Next we test di0erent sets of parameters of tabu search based k-modes clustering algorithm. For each set of parameters (NTLM; NH; d), the algorithm is run 100 times. The results are listed in Table 5. We see from the table that the average clustering accuracy of the tabu search based k-modes algorithm is above 95% and the number of runs that all records are correctly clustered into the 4 given clusters is at least 40 out of 100. Again the tabu search based k-modes algorithm is better than the fuzzy k-modes algo-

0.95 195

205 215 225 235 Average optimal objective function value

245

Fig. 1. Average clustering accuracy and average objective function values.

rithm in all cases. From Table 5, we also have the following observations: 1. When NTLM and NH are 1xed, the clustering accuracy decreases when d increases. Since when d increases, the number of objects in the neighborhood of the current solution increases. However, NH is 1xed, this implies that the tabu search is not easy to explore the solution space and 1nd the better solution. 2. When NH and d are 1xed, the clustering accuracy is about the same even NTLM changes. Similarly, when NTLM and d are 1xed, the clustering accuracy is about the same even NH changes. Since the storage cost of the tabu search based k-modes algorithm depends on the sizes of NTLM and NH , these phenomenon can reduce the memory requirement of the proposed tabu search based algorithm. We also study the relationship between the objective function value of the optimization problem (7) and the clustering accuracy. Fig. 1 shows the average clustering results in Table 5 and their corresponding average objective function values. We see that the average objective function values with high clustering accuracy is less than those with low clustering accuracy. This experimental relationship indicates that we can use the objective function values to choose a good clustering result if the original classi1cation of data is unknown. Next we test the tabu search based k-prototypes algorithm to partition the credit approval data set into 2 clusters with di0erent values. Here is a weight to balance the numeric and categorical parts to avoid favoring either type of attribute. In the clustering process, all numeric attributes in the data set are rescaled to the range of [0; 1] as suggested as in the paper [10]. The initial cluster centers are arbitrarily chosen k records from the data set. For each value, the algorithm is run 100 times. In this test, we set NTLM = 100; NH = 100 and d = 1.

M.K. Ng, J.C. Wong / Pattern Recognition 35 (2002) 2783 – 2790

2789

Table 6 Clustering results of tabu search based clustering algorithm for di0erent values of Accuracy 0.5 0.84 0.83 0.82 0.81 0.80 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 Average accuracy

0.7

0.9

1

1.0

1.2

1.3

1.4

2

9 46 23 12 5

8 57 21 10

8 44 25 16 2 1 1 1

4

1 2 1 0.792

1

0.795

8 47 27 13 1 1 2 1

0.794

0.793

6 52 26 12 3

9 48 29 12 2

1

1 8 55 24 10

6 48 29 14 2

1 1

0.794

0.795

0.796

0.795

0.79

Table 7 Average CPU time required in seconds for di0erent clustering methods

0.78

Data set

Fuzzy k-modes

Tabu search based fuzzy k-modes

Soybean disease

0.37

36.25

Data set

Fuzzy k-prototypes

Tabu search based fuzzy k-prototypes

Credit approval

6.35

659.64

0.80

Average accuracy

1.1

0.77 0.76 0.75 tabu search based clustering algorithm kprototypes

0.74 0.73 0.72 0.71 0.5

0.6

0.7

0.8

0.9 1 beta

1.1

1.2

1.3

1.4

Fig. 2. Average clustering accuracy for di0erent values of .

In Table 6, we show the classi1cation accuracy results. In all cases, the clustering accuracy is higher than 70%. On average, the clustering accuracy is about 80%. We remark that the average clustering accuracy of using k-prototypes algorithm is only about 77%. Again the tabu search based k-prototypes algorithm produces a more accurate clustering result. We also see from the table that the values of do not a0ect the clustering accuracy of the tabu search based algorithm, see Fig. 2. In contrast, when we use the k-prototypes algorithm to cluster the data set, we 1nd that the accuracy is lower when is small, and the accuracy is higher when is large, see Fig. 2. Based on these observations, we may conclude that small indicates the clustering favored nu-

meric attributes, the two classes of instances cannot be well separated using just the numeric attributes, and the categorical attributes dominate the clustering. However, this is not a correct interpretion since the results of the tabu search based k-prototypes algorithm show that there are no signi1cant differences in the accuracy for di0erent values. Because the k-prototypes algorithm always 1nd local optimum solutions, the clustering accuracy may be a0ected by using these local optimum solutions. Using tabu search based techniques, our algorithm can explore the solution space beyond local optimality in order to aim at 1nding a global solution of the fuzzy clustering problem. The clustering results produced by the proposed algorithm can be high in accuracy. Finally, Table 7 gives the average CPU time required by the fuzzy k-modes and the fuzzy k-prototypes algorithms on a 1 GHz PC. The time used by the fuzzy k-modes and the fuzzy k-prototypes algorithms were signi1cantly less than those used by the tabu search based fuzzy k-modes and the fuzzy k-prototypes algorithms. However, we remark that even the fuzzy k-modes and the

2790

M.K. Ng, J.C. Wong / Pattern Recognition 35 (2002) 2783 – 2790

fuzzy k-prototypes algorithms can process the data sets eGciently, they cannot give the high clustering accuracies obtained by using the tabu search based fuzzy k-modes and the fuzzy k-prototypes algorithms, see Fig. 2 and Table 4. To improve the eGciency of the tabu search based fuzzy k-modes and the fuzzy k-prototypes algorithms, their parallel versions can be developed. In the literature [12–14], it is empirically shown that parallelization of the sequential tabu search algorithm does not reduce solution quality while providing substantial speedups in practice. 7. Concluding remarks We have introduced the tabu search based fuzzy k-modes algorithm for clustering categorical objects. The most important result of this work is the procedure in Section 4 that allows the tabu search paradigm to be used for clustering categorical data. This procedure removes the numeric-only limitation of the tabu search based fuzzy k-means algorithm. The experimental results have shown that the tabu search based k-modes-type algorithms are e0ective in recovering the inherent clustering structures from categorical data if such structures exist. References [1] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, J. Cybernet. 3 (3) (1974) 32–57. [2] E.R. Ruspini, A new approach to clustering, Inform. Control 19 (1969) 22–32. [3] J.C. Bezedek, Fuzzy mathematics in pattern classi1cation, Ph.D. Dissertation, Department of Applied Mathematics, Cornell University, Ithaca, New York, 1973.

[4] J.C. Bezedek, A convergence theorem for the fuzzy ISODATA clustering algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 2 (1980) 1–8. [5] R.J. Hathaway, J.C. Bezdek, Local convergence of the fuzzy c-means algorithms, Pattern Recognition 19 (6) (1986) 477–480. [6] S.Z. Selim, M.A. Ismail, K-means-type algorithms: a generalized convergence theorem and characterization of local optimality, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1) (1984) 81–87. [7] S.Z. Selim, M.A. Ismail, Fuzzy C-means: optimality of solutions and e0ective termination of the algorithm, Pattern Recognition 19 (6) (1986) 651–663. [8] F. Glover, M. Laguna, Tabu Search, Kluwer Academic Publishers, Boston, 1997. [9] K.S. Al-Sultan, C.A. Fedjki, A tabu search-based algorithm for the fuzzy clustering problem, Pattern Recognition 30 (12) (1997) 2023–2030. [10] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining Knowledge Discovery 2 (3) (1998) 283–304. [11] Z. Huang, M.K. Ng, A fuzzy k-modes algorithm for clustering categorical data, IEEE Trans. Fuzzy Systems 7 (4) (1999) 446–452. [12] B. Garcia, J. Potvin, J. Rousseau, A parallel implementation of the tabu search heuristic for vehicle routing problems with time window constraints, Comput. Oper. Res. 21 (1994) 1025– 1033. [13] M. Malek, M. Guruswamy, M. Pandya, H. Owens, Serial and parallel simulated annealing and tabu search algorithms for the traveling salesman problem, Ann. Oper. Res. 21 (1989) 59–84. [14] J. Chakrapani, J. Skorin-Kapov, Massively parallel tabu search for the quadratic assignment problem, Ann. Oper. Res. 41 (1993) 327–341.

About the Author—MICHAEL K. NG was born in Hong Kong, China, in 1967. He received B. Sc. and M. Phil. degrees in Mathematics from the University of Hong Kong, in 1990 and 1993, respectively, and a Ph.D. degree in Mathematics from the Chinese University of Hong Kong, in 1995. From 1995 to 1997 he was a Research Fellow at the Australian National University. He is currently an Assistant Professor in the Department of Mathematics at the University of Hong Kong. Ng’s research interests are in the areas of data mining, operations research and scienti1c computing. He has been selected as one of the recipients of the Outstanding Young Researcher Award of the University of Hong Kong in 2001. About the Author—JOYCE WONG received a B.Sc. degree in Mathematics from the University of Hong Kong in 1999.