Scale-invariant clustering with minimum volume ellipsoids

Scale-invariant clustering with minimum volume ellipsoids

Computers & Operations Research 35 (2008) 1017 – 1029 www.elsevier.com/locate/cor Scale-invariant clustering with minimum volume ellipsoids Mahesh Ku...

185KB Sizes 0 Downloads 36 Views

Computers & Operations Research 35 (2008) 1017 – 1029 www.elsevier.com/locate/cor

Scale-invariant clustering with minimum volume ellipsoids Mahesh Kumara,∗ , James B. Orlinb a Rutgers Business School, Rutgers University, 180 University Avenue, Newark, NJ 07102, USA b Operations Research Center, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Bldg. E40-149, Cambridge, MA 02139, USA

Available online 22 August 2006

Abstract This paper develops theory and algorithms concerning a new metric for clustering data. The metric minimizes the total volume of clusters, where the volume of a cluster is defined as the volume of the minimum volume ellipsoid (MVE) enclosing all data points in the cluster. This metric is scale-invariant, that is, the optimal clusters are invariant under an affine transformation of the data space. We introduce the concept of outliers in the new metric and show that the proposed method of treating outliers asymptotically recovers the data distribution when the data comes from a single multivariate Gaussian distribution. Two heuristic algorithms are presented that attempt to optimize the new metric. On a series of empirical studies with Gaussian distributed simulated data, we show that volume-based clustering outperforms well-known clustering methods such as k-means, Ward’s method, SOM, and model-based clustering. 䉷 2006 Elsevier Ltd. All rights reserved. Keywords: Minimum volume ellipsoid; Outliers; Scale-invariant clustering; Robust clustering

1. Introduction Clustering is a fundamental technique that divides data into groups (clusters) for the purpose of summarization or improved understanding. Clustering has been widely studied for several decades across multiple disciplines including statistics, operations research, data mining, and machine learning, and across multiple application areas including taxonomy, medicine, astronomy, marketing, finance and e-commerce. The problem of clustering is defined as follows: given n data points, x1 , . . . , xn , in a d-dimensional metric space, partition the data into K n clusters, C1 , . . . , CK , such that data points within a cluster are more similar to each other than data points in different clusters. Most clustering methods form clusters based on proximity between the data points in the x space. A commonly used measure of proximity between a pair of data points is the squared Euclidean distance between the points. A popular criterion for clustering is to minimize the sum of within-cluster squared Euclidean distances given by min

C1 ,...,CK

K  

xi − ck 2 ,

(1)

k=1 xi ∈Ck

where ck is the mean of the data points in cluster Ck . Several clustering methods, including k-means [1] and Ward’s hierarchical clustering [2], optimize this criterion. ∗ Corresponding author. Tel.: +1 973 353 5033; fax: +1 973 353 5003.

E-mail addresses: [email protected] (M. Kumar), [email protected] (J.B. Orlin). 0305-0548/$ - see front matter 䉷 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.cor.2006.07.001

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

Weight in pounds

Weight in kilograms

1018

Height in centimeters

Height in inches

Fig. 1. Clusters obtained using k-means on height and weight measurements.

A drawback of these and other traditional clustering methods is that the clustering results are sensitive to the units of measurement, that is, changing the measurement units may lead to a different clustering structure, as shown in Fig. 1. A commonly used approach to fix this problem is to normalize the data to unit variance on each variable before clustering. Normalization converts the original variables into unitless variables, but it can produce ad hoc clusters as suggested in [3]. This motivates the development of a new scale-invariant clustering method that is invariant under an affine transformation of the data space. Another drawback of a large number of clustering methods is that they do not account for outliers. Among methods that do account for outliers, most of them either cluster first and then identify the outliers, or identify the outliers first and then cluster the remaining data [4–6]. There arises a circularity in this approach that if we knew the correct clusters then we could identify the correct outliers, and if we knew the correct outliers then we could find the correct clusters. This paper focuses on two topics: (i) developing and analyzing a new criterion for clustering that has a scaleinvariance property and (ii) incorporating methodology that simultaneously clusters and identifies outliers. We address these two issues via a new metric for clustering that is based on the volume computation of clusters. The volume of a cluster is defined as the volume of the minimum volume ellipsoid (MVE) enclosing all data points in the cluster. The goal is to minimize the sum of the volumes of these clusters. There are several ways to compute volume of a set of data points (for example, volume of a set of data points can be computed using a convex hull, sphere, rectangle, etc.). We chose the MVE in part because real-world data often exhibits a mixture of Gaussian distributions, which have equidensity contours in the shape of ellipsoids [7]. Further, representing clusters using ellipsoids provides a good visualization tool for the clusters [8]. We show that the new metric based on volume computation using MVE is scaleinvariant. Given a data set to be clustered and a pre-specified value of , we permit an  fraction of the data to be treated as outliers. The resulting problem is to find k clusters that include at least 1 −  fraction of the data and have the least total volume. In our empirical study, we found that this method of treating outliers performs well when the data comes from a mixture of Gaussian distributions. When there is only one Gaussian-distributed cluster, we show that the above method of treating outliers discovers the original data distribution asymptotically. Researchers have shown [9,10] that it is not easy to compute the MVE enclosing a set of data points. The combinatorial nature of the clustering problem adds further computational challenge. Using an approximate solution for the MVE developed by Sun and Freund [11], we present two heuristic algorithms, kVolume and hVolume, that are suitable for volume-based clustering. Using these algorithms, we are able to cluster one thousand data points in up to five dimensions in less than five minutes on a personal computer. We illustrate the effectiveness of volume-based clustering on a series of simulated data sets and contrast it to k-means, Ward’s method, SOM, and model-based clustering. The rest of the paper is organized as follows. Section 2 summarizes previous work related to scale-invariant or volume-based clustering. Section 3 presents definitions and theoretical properties of the volume-based metric. Section 4 presents two heuristic algorithms, kVolume and hVolume, that attempt to minimize the volume-based metric. In Section 5, we present empirical study results on a series of simulated data sets. Finally, we provide a summary and future research directions in Section 6.

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

1019

2. Literature review The first elaborate work on scale-invariant clustering was done by Friedman and Rubin [12]. They used the Mahalanobis distance function [13], which is scale-invariant. Marriott [14] generalized the work of Friedman and Rubin to propose several clustering criteria, some of which are scale-invariant. On similar lines, Fraley and Raftery [15] developed a model-based clustering approach for Gaussian clusters whose clustering criterion is scale-invariant. Knorr et al. [3] discussed the problem of scaling in data mining problems and proposed a robust space transformation approach using the Donoho–Stahel estimator and showed its application in outlier detection. The problem of clustering using ellipsoids was first considered by Rosen [16] in the context of separating patterns using convex programming. Barnes [17] provided a heuristic algorithm for this problem using the concept of eigenvalue decomposition. Jolion et al. [8] and Tamez-Pena and Perez [18] showed that using the MVE as a basis for clustering produces robust clusters and showed its application in computer vision and image segmentation (also see [19] for a review of work on robust clustering). In all these papers, the authors faced serious computational challenges because it is not easy to compute the MVE for a set of data points. Although, the MVE can be computed for a set of data points in polynomial time using semi-definite programming [10], in practice it takes a lot of computer time even for medium size data sets. SAS offers a subroutine for the MVE computation using an algorithm proposed by Rousseeuw and Leroy [9], but even this is not computationally fast enough to be used for clustering. Given the wide applicability of computing the MVE, a number of approximation algorithms were developed recently (see [11] and references therein for a review). An approximate MVE for a set of data points is an ellipsoid that encloses all the points and whose volume is only slightly higher than the volume of the MVE. In our empirical studies, we have used the approximation algorithm developed by Sun and Freund [11] using the notion of active sets. 3. Volume-based clustering metric Given a set of n data points x1 , x2 , . . . , xn in Rd to be grouped into K clusters, and an  between 0 and 1 that corresponds to the fraction of data to be treated as outliers, the objective of volume-based clustering is to minimize the sum of the volumes of the clusters where the volume of a cluster is defined as the volume of the MVE enclosing all points in the cluster, and the outliers are identified as defined below. Definition 1. Outliers: Given a set of data points S and an upper bound  > 0 on the fraction of data to be treated as outliers, the outliers form a subset O of the data such that |O| |S| and the volume-based clustering obtains the least total cluster volume on the remaining data, S\O. Let vol(C) denote the volume of the MVE enclosing all points in a cluster C, then volume-based clustering is formally defined as Min

s.t.

K  i=1 K 

vol(Ci ) |Ci | (1 − )n,

i=1

C1 , C2 , . . . , CK ⊆ {x1 , . . . , xn }.

(2)

Here C1 , C2 , . . . , CK are mutually disjoint clusters. An ellipsoid is mathematically defined using its center c and a symmetric positive definite covariance matrix Q as below. E(c, Q) = {x|(x − c)t Q−1 (x − c) 1}.

(3)

The volume of E(c, Q) is equal to det(Q)1/2 [10]. Thus, the MVE enclosing a cluster of points C is an ellipsoid E(c, Q) for which det(Q)1/2 is minimized and for which the inequality (x − c)t Q−1 (x − c) 1 holds for all x ∈ C.

1020

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

Using this definition of MVE, the formulation in (2) can be rewritten as Min

K 

det(Qk )1/2

k=1

s.t.

(xi − ck )t Q−1 k (xi − ck ) 1 ∀xi ∈ Ck ∀Ck , K 

|Ck | (1 − )n,

k=1

C1 , C2 , . . . , CK ⊆ {x1 , . . . , xn }, Qk ∈ , k = 1, . . . , K,

(4)

where  is the set of symmetric positive definite matrices. Lemma 1. Let f : x → Ax + u be an affine transformation of the data space, then for any cluster of data points C vol(f (C)) = vol(C)| det(A)|.

(5)

Proof. Let E(c∗ , Q∗ ) be the MVE for cluster C in the untransformed space, that is, c∗ and Q∗ form an optimal solution of the following optimization problem Min s.t.

det(Q)1/2 (x − c)t Q−1 (x − c) 1 Q∈

∀x ∈ C, (6)

and vol(C) = det(Q∗ )1/2 . The MVE computation problem in the transformed space becomes Min s.t.

det(Q)1/2 (Ax + u − c)t Q−1 (Ax + u − c)1 ∀x ∈ C, Q ∈ .

(7)

It is easy to verify that E(Ac∗ + u, AQ∗ At ) is an optimal solution of the problem in (7). Therefore, vol(f (C)) = det(AQ∗ At )1/2 = det(Q∗ )1/2 | det(A)| = vol(C)| det(A)|.  Theorem 1. The volume-based metric is invariant under an affine transformation of the data space. Proof. Let f : x → Ax + u be any affine transformation of the data space. The optimal clustering in the transformed space is given by Min

s.t.

K  i=1 K 

vol(f (Ci )) |Ci | (1 − )n,

i=1

(8) C1 , C2 , . . . , CK ⊆ {x1 , . . . , xn }. K K From Lemma 1, i=1 vol(f (Ci )) = | det(A)| i=1 vol(Ci ). Since det(A) is a constant, the optimal clusters in the formulation in (8) are the same as the optimal clusters in the untransformed space according to formulation in (2).  3.1. Outliers treatment It is important to note that volume-based clustering incorporates the concept of outliers in its objective function; thus it facilitates algorithms that identify clusters and outliers simultaneously. We show that when the data consists of

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

1021

a single Gaussian distributed cluster, the proposed method of identifying outliers is able to discover the correct data distribution asymptotically. Let us assume that the data comes from a single multivariate Gaussian distribution N (, ), where  is the mean of the distribution and  is its covariance matrix. Theorem 2. For a given  > 0, as n → ∞, the MVE enclosing at least 1 −  fraction of data coming from N (, ) is given by ellipsoid E(, r) for some constant r. Proof. As n → ∞, the fraction of data enclosed in a region is equal to the cumulative density of the data distribution in that region. That means the theorem is equivalent to saying that the MVE that encloses at least 1 −  cumulative density of N (, ) is given by an ellipsoid E(, r) for some constant r. Let us choose an r such that the cumulative density of the region inside E = E(, r) is equal to 1 − . To the contrary, let us assume that there exist an ellipsoid, E = E that has cumulative density of at least 1 −  and whose volume is smaller than the volume of E. This implies that   dx. (9) dx < E \E

E\E

−1

The density function of N (, ) is given by f (x) = (2)−d/2 ||−1/2 e−1/2(x−)  (x−) . From the definition of E = E(, r) in (3), it follows that for all points inside E, f (x) (2)−d/2 ||−1/2 e−r/2 = c, where c is a constant, and for all points outside E, f (x) < c. This implies the following.    f (x) dx = f (x) dx + f (x) dx t

E

E∩E



 <

E∩E

f (x) dx + 

 <

E∩E

f (x) dx + 

 

E∩E

f (x) dx +



E \E

E \E

E\E

E\E

c dx c dx f (x) dx

f (x) dx = 1 − 

=

(10)

E

which contradicts our assumption that E has cumulative density of at least 1 − , and hence the result follows.



Doing a similar analysis for more than one cluster is not easy, and the natural extension of Theorem 1 to two clusters is not true. In Section 5, we present results from a series of empirical studies on two and three clusters that show that allowing a constant fraction of outliers improves the quality of clusters substantially over the case when outliers are not accounted for. 4. Volume-based clustering algorithms In this section we propose two algorithms that attempt to minimize the total volume of clusters: (1) kVolume, an iterative algorithm that partitions the data into a specified number of clusters and (2) hVolume, a hierarchical clustering algorithm that produces a nested sequence of clusters. In both algorithms, we use the approximate MVE algorithm developed by Sun and Freund [11]. 4.1. kVolume: an iterative algorithm The kVolume algorithm starts with an initial clustering into K clusters and makes incremental improvements in total cluster volume until there is no improvement possible. It is a modification of the k-means algorithm.

1022

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

4.1.1. Initial clustering There are several algorithms available in the literature for initial partitioning of a data set [20]. A drawback in a majority of these algorithms is that the initial partition of data usually depends on the units of data measurement. In order to maintain the scale-invariance property of the final clusters, we must have an initial partition that has small total volume and at the same time is independent of the units of measurement. We propose InitClust algorithm for finding an initial clustering for the kVolume algorithm. The InitClust algorithm starts with a single cluster consisting of all data points. The algorithm performs K −1 division steps, each of which increases the number of clusters by one. At each step, the algorithm selects an existing cluster randomly and divides it into two clusters as described below. It randomly selects d data points from the selected cluster and computes a d-dimensional hyperplane passing through these points. Then the algorithm partitions the selected cluster into two sub-clusters, one on each side of this hyperplane. The InitClust algorithm is formally described in Algorithm 1. Algorithm 1 (I nitClust(x1 , . . . , xn ; K)). 1: N umClust = 1 2: C1 = {x1 , . . . , xn } 3: while N umClust < K do 4: Select a cluster Ci randomly, 1 i N umClust 5: Select d data points randomly from Ci 6: Calculate a hyperplane H passing through all d points 7: Divide Ci into two sub-clusters such that the sub-clusters lie on either side of H; Points that lie on the hyperplane can be randomly assigned to either sub-cluster 8: N umClust = N umClust + 1 4.1.2. kVolume algorithm The kVolume algorithm is an iterative algorithm that starts with an initial partition of data into K clusters and cycles through the following three steps. • Step 1: For a given set of K clusters, compute the MVE E(ck , Qk ) for clusters k = 1, . . . , K. • Step 2: Compute di , the distance of xi to its closest cluster according to the distance function in (11). Sort xi in the decreasing order of di and label the first  fraction of data as outliers. • Step 3: Reassign each of the remaining data points to its closest cluster. The distance of a data point xi from cluster Ck is given by d(xi , Ck ) = (xi − ck )t Q−1 k (xi − ck ).

(11)

Note that d(x, C) 1 if and only if x is inside the MVE of cluster C. This distance function is popularly known as the Mahalanobis distance, which has the scale-invariance property. We have chosen this distance function because it guarantees a decrease in the objective function in each iteration of kVolume, as shown below. Lemma 2. The total cluster volume decreases monotonically in successive iterations of the kVolume algorithm. Proof. Consider a cluster Ci that changed to Ci after an iteration of kVolume. Let us consider a point x that moved from some other cluster (say Cj ) to Ci . This implies that d(x, Ci ) < d(x, Cj ). Further, we know that d(x, Cj ) 1 because x was initially inside Cj . The two inequalities together imply that d(x, Ci ) < 1, which means that x was already inside the MVE of Ci before its movement. The above argument implies that the MVE of Ci encloses all points in Ci . Therefore, the volume of the MVE of Ci is no more than the volume of the MVE of Ci , and hence the result follows.  Proposition 1. The kVolume algorithm converges in a finite number of iterations. Proof. The theorem follows from Lemma 2 and the fact that there are only finite number of different partitions of a set of data points. 

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

1023

The kVolume algorithm is formally described in Algorithm 2. Algorithm 2 (kV olume(x1 , . . . , xn ; K; )) 1: Get initial clustering using InitClust(x1 , . . . , xn ; K) 2: Compute E(ck , Qk ) for all clusters Ck , k = 1, . . . , K 3: Compute d(xi , Ck ) for each data-cluster pair 4: di = minCk d(xi , Ck ) 5: Sort di in decreasing order and label the first n data points as outliers 6: Reassign each of the remaining n − n data points to its closest cluster 7: if Clusters change then 8: go to Step 2 A drawback of the kVolume algorithm is that the final clusters may depend on the initial partition. It can also produce empty clusters if all points in a cluster are reassigned to other clusters, thereby reducing the number of clusters. The k-means algorithm also has these shortcomings [20]. We propose the following approach, which is similar to the one that is often used in k-means. Run the kVolume algorithm a large number of times with different random initial partitions and pick the one that has the least total volume. We ignore those solutions that contain one or more empty clusters. Pena et al. [20] have shown that, if k-means is run a large number of times, the resulting clusters will be close to optimal and insensitive to the initial partition. In our empirical studies, we have found that this is also true for kVolume when the algorithm was run 100 times with different initial partitions. The time complexity of kVolume is o(KI (d 3 n+M)) , where d is the dimension of data, n is the number of data points, K is the number of clusters, I is the number of iterations the algorithm makes, and M is the time it takes to compute the MVE for n data points in Rd . In our empirical studies, we have found that the algorithm generally converges after a few (typically less than 20) iterations. Further, the algorithm computes volumes of clusters only once at the beginning of each iteration. The computation time is unaffected by , the fraction of outliers in the data. On a standard PC, kVolume is able to cluster 1000 data points in five dimensions in less than five minutes. 4.2. hVolume: a hierarchical algorithm Next, we develop a hierarchical greedy heuristic that we call as the hVolume algorithm. The hVolume algorithm is a modification of Ward’s agglomerative hierarchical clustering algorithm [2]. The key idea in most hierarchical clustering algorithms is to start with n singleton clusters (one cluster for each data point) and then at each step of the algorithm merge a pair of clusters that leads to the minimum increase in the objective function. One difficulty in using a standard hierarchical algorithm for volume-based clustering is that when we merge any two singleton clusters, the increase in total volume is zero because the volume of a cluster consisting of only two data points is zero. This may lead to a merger of two data points that are very far apart. We deal with this difficulty by modifying the standard hierarchical algorithm as follows. Instead of starting with n singleton clusters, we start with G > K clusters where each starting cluster has at least d +1 points and positive volume. In our implementation of hVolume, we take G = n/(d + 1) and use kVolume(x, G, 0) to obtain the starting clusters. Starting with G > K initial clusters, each step of the hVolume algorithm combines a pair of clusters that leads to the minimum increase (or maximum decrease) in total volume of the resulting clusters. Thus each step of the algorithm reduces the number of clusters by one. The merging process continues until we have K clusters left. The algorithm is formally described in Algorithm 3. Algorithm 3 (hV olume(x1 , . . . , xn ; K; )) 1: initialization: G = n/(d + 1) ; Find G clusters, C1 , . . . , CG , using kV olume(x, G, 0) 2: (i, j )=vol(Ci ∪Cj , )−vol(Ci , )−vol(Cj , ), for 1 i < j G, where vol(C, ) is the volume of MVE enclosing at least 1 −  fraction of points from cluster C 3: while G > K do 4: Merge the pair of clusters Ci and Cj for which (i, j ) is minimized 5: G=G−1 6: Recalculate (i, j ) for 1 i < j G

1024

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

The hVolume algorithm performs the MVE computation O(n2 /d 2 ) times and is, therefore, slower than the kVolume algorithm. An advantage of using hVolume is that it produces a set of nested clusters for each value of K. This approach is especially valuable when the number of clusters is not specified as part of the input. 5. Empirical study In this section, we present a comparative study of volume-based clustering against a number of well-known clustering methods on a series of Gaussian distributed simulated data sets. We compared clustering results using kVolume and hVolume algorithms against four other clustering methods: k-means, Ward’s method, SOM, and model-based clustering. We implemented k-means and Ward’s methods ourselves and used professional MATLAB packages for SOM [21] and model-based clustering [15,22]. The performance of a clustering algorithm is measured using its cluster accuracy, which is defined as the percentage of data identified with the correct cluster membership [21]. The outliers are not counted while computing cluster accuracy. Note that while outliers for kVolume and hVolume were identified as proposed in this paper, the outliers for the other clustering methods were identified at the end of the algorithm based on their Euclidean distance from the closest cluster center. We found that the cluster accuracy is significantly higher for kVolume and hVolume than for the other clustering methods. We also found that kVolume generally performed slightly better than the hVolume algorithm. A likely reason for this is that we ran kVolume with 100 different random initial partitions and picked the one that achieved the lowest total volume. This helped kVolume achieve better clusters than hVolume. 5.1. Generating simulated data We generated a series of simulated data sets for the case of two and three clusters in two, three, and four dimensions using the multivariate Gaussian distribution function in MATLAB. The data generation process is similar to the one proposed in [21,23,24]. The main difference between the data used in [21,23,24] and the one used in this study is that these papers use diagonal covariance matrices to generate Gaussian distributed data, while we use general covariance matrices, which are obtained by multiplying the diagonal covariance matrices by randomly generated correlation matrices obtained using the randcorr function in MATLAB. This multiplication converts the diagonal covariance matrices into general covariance matrices. A total of 54 data sets are generated. The first six data sets that we call as the “base” data are generated one for each case of 2 or 3 clusters in 2, 3, or 4 dimensions. Then 48 additional data sets are generated by modifying the six base data sets in one of the following ways: (1) changing density levels of the first cluster to 10% or 60% of the total population, (2) adding 10% or 20% of outliers to each cluster, (3) including 1 or 2 irrelevant variables, and (4) creating minor or major differences in the dispersion level of the clusters. The data set design is summarized in Table 1. The six base data sets are generated as follows. Fifty points are generated for each cluster from a multivariate Gaussian distribution. For each cluster, a range value is randomly generated for each dimension of space using a uniform distribution of 10–40 units. The center of the cluster is the midpoint of this range. The covariance matrix for the Gaussian distribution that is used to generate data for a cluster is given by (S ∗ S) ∗ R, where S is a diagonal matrix consisting of the standard deviation on each dimension (thus S ∗ S is a diagonal matrix consisting of the variance on each dimension) and R is a randomly generated correlation matrix obtained using the randcorr function in MATLAB. Each diagonal entry of the S matrix is obtained by multiplying a dispersion multiplier to the range on that dimension. 1 Three values of dispersion multiplier are used in this experiment: 12 (low), 16 (medium), and 13 (high), where a higher value leads to higher standard deviation. The clusters are placed in such a way that their centers are separated by a random quantity f ∗ (si + sj ), where si is the standard deviation for cluster i on the first dimension, and f is a random value generated from a uniform distribution of 1–3 units. From the six base data sets, 12 additional data sets are generated by altering the uniform density (50 points per cluster) of the base data in two ways. For the 10% cluster density, the number of data points in the first cluster are reduced to 10% of the total population, and the remaining points are equally distributed among the remaining clusters. For the 60% cluster density, the same procedure is applied with the first cluster containing 60% of the total population. Another 12 data sets are generated by adding two different levels of outliers to the six base data sets. At the 10% outlier level, five outlier points are added to each cluster that are Gaussian distributed about the cluster center with a covariance matrix that is nine times the covariance of the remaining points in the cluster. At the 20% outlier level, 10

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

1025

Table 1 Data set design Data set

Design factors

# Data sets

Base data Cluster density Outliers Irrelevant information Dispersion difference

2 or 3 clusters; 2,3, or 4 dimensions 10% and 60% density levels 10% and 20% of outliers 1 or 2 irrelevant variables Major and minor levels of dispersion difference

6 12 12 12 12

Table 2 Cluster accuracy for the base data at  = 0 Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

74.1 75.6 83.3 90.8 88.9 89.0

77.4 74.6 86.2 90.8 89.7 92.1

76.6 78.4 85.5 92.9 91.3 92.0

58.8 64.3 78.9 81.4 84.1 85.6

56.7 57.3 73.8 82.9 82.2 86.2

58.7 61.4 75.2 83.8 81.6 83.0

67.1 68.6 80.5 87.1 86.3 88.0

Table 3 Cluster accuracy for the base data at  = 0.05 Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

75.5 76.8 84.0 91.3 95.8 96.4

79.5 79.1 88.6 92.7 96.7 97.2

78.6 77.9 86.8 93.0 98.5 98.7

60.3 65.2 80.1 84.5 90.3 93.0

53.9 60.6 76.7 82.1 89.5 90.9

59.6 61.9 75.2 86.2 95.0 98.6

68.8 70.7 82.1 88.7 94.8 96.2

points added as outliers to each cluster. These outliers are not counted in the calculation of cluster accuracy for each clustering method. Another 12 data sets are generated by adding two levels of irrelevant information (by adding 1 or 2 irrelevant variables) to the base data. These irrelevant variables are uniformly distributed across all of data space and has nothing to do with cluster membership information. Finally, 12 additional data sets are generated by creating difference in the dispersion level of clusters in the base 1 data. For minor dispersion difference, a dispersion multiplier of 12 (low) is used for the first cluster and 16 (medium) 1 is used for the remaining clusters. For major dispersion difference, a dispersion multiplier of 12 is used for the first 1 cluster and 3 (high) is used for the remaining clusters. This step creates clusters of different sizes. 5.2. Experimental results The first experiment was conducted on the six base data sets to measure the effect of allowing outliers in various clustering methods, the concept discussed in Section 3.1. We considered three choices of : 0%, 5%, and 10%, which is the fraction of data that is labeled as outliers. Each scenario of the experiment was run 100 times and the average cluster accuracy for various clustering methods at various levels of  is reported in Tables 2–4. In these tables, the number on the top of each column (for example 2–2, 2–3, or 3–4) means that the first number is the number of clusters and the second number is the number of dimensions. For example, “2–3” refers to two clusters in three-dimensional space. We notice that the cluster accuracy for all methods improved after allowing for a positive fraction of outliers. The improvement is much bigger for the case of volume-based clustering compared to the other methods. We also note

1026

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

Table 4 Cluster accuracy for the base data at  = 0.10 Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

75.3 77.5 85.1 91.8 96.1 96.2

78.7 78.6 89.3 92.6 96.3 95.3

79.3 78.8 88.0 91.6 97.8 97.0

61.0 65.7 80.6 85.0 92.4 92.5

54.2 62.2 77.3 84.2 90.2 91.8

59.8 62.4 75.0 85.6 93.9 96.4

68.1 70.9 82.6 88.5 94.5 94.9

Table 5 Cluster accuracy for 10% density level Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

61.2 60.0 76.3 81.8 88.1 90.2

59.7 67.3 78.4 86.2 92.9 91.3

62.8 59.9 75.4 82.9 91.1 93.2

47.6 52.1 67.5 79.3 84.3 87.6

50.1 55.7 66.7 76.4 86.8 91.6

52.8 56.5 70.0 77.2 89.3 88.6

58.3 60.6 74.0 82.1 90.0 91.6

Table 6 Cluster accuracy for 60% density level Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

70.2 67.3 82.6 88.8 93.3 92.0

71.1 68.3 87.6 90.8 95.2 94.7

76.5 72.4 83.7 91.1 96.6 97.2

60.0 57.5 78.2 85.6 89.1 90.9

47.2 52.6 79.3 83.9 90.1 93.0

53.8 58.4 76.3 84.5 92.7 91.3

64.7 64.3 81.6 87.9 93.5 93.9

Table 7 Cluster accuracy for 10% outliers Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

58.1 60.4 71.8 77.4 85.2 84.6

54.3 57.1 73.5 76.7 82.8 80.0

59.8 56.2 70.4 72.6 84.7 86.3

43.3 45.6 63.8 68.3 79.1 81.2

48.6 52.3 64.7 69.4 76.5 78.8

50.2 49.5 60.8 70.0 75.9 75.5

55.5 56.4 69.8 75.0 83.1 83.6

that there is no significant difference in cluster accuracy between the cases of  = 0.05 and  = 0.10. Therefore, in the remainder of this section, we present experimental results using only  = 0.05. Results on the other 48 data sets are presented in Tables 5–12. We find that, in all these experiments, volume-based clustering performed significantly better than the other clustering methods. In general, k-means and Ward’s methods performed the worst, followed by SOM, then model-based clustering, and then volume-based clustering. The key reason why model-based and volume-based clustering performed better than the other methods is that these methods are designed for Gaussian distributed data (with general covariance matrices), whereas k-means is suitable for spherical clusters and SOM is suitable for elliptical clusters aligned along the axes

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

1027

Table 8 Cluster accuracy for 20% outliers Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

55.4 54.2 67.8 70.3 81.8 81.1

57.0 55.5 70.1 72.7 80.9 79.2

60.7 59.3 68.9 73.5 81.0 83.3

40.1 42.8 60.3 65.7 75.1 78.0

42.5 47.2 61.9 68.4 76.2 76.8

46.3 49.6 57.5 69.1 74.8 77.4

53.7 54.6 67.2 72.9 81.1 82.0

Table 9 Cluster accuracy for one irrelevant variable Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

66.2 65.7 76.4 80.1 89.0 90.5

68.8 65.9 75.4 83.7 91.6 91.1

65.5 68.6 79.3 85.8 93.0 94.1

60.3 62.2 71.8 77.5 85.0 89.3

58.4 59.1 72.6 78.5 88.9 90.2

56.6 59.3 70.9 78.7 87.4 88.7

64.3 64.9 75.7 82.2 90.4 91.8

Table 10 Cluster accuracy for two irrelevant variables Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

62.5 63.3 73.2 79.9 81.5 83.1

61.0 64.4 74.7 80.8 85.3 86.6

63.1 65.7 75.9 82.4 82.3 87.0

58.3 57.7 67.4 75.1 78.9 79.1

54.5 56.4 65.7 77.3 81.4 82.2

53.1 54.6 68.5 72.8 79.4 79.8

60.9 62.2 72.7 79.9 83.3 84.5

Table 11 Cluster accuracy for minor difference in dispersion level Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

72.8 73.2 82.2 89.5 98.6 99.1

74.9 75.3 85.5 91.1 97.7 98.2

70.2 62.3 85.9 92.0 98.1 97.4

62.7 60.3 76.7 88.2 93.1 94.0

63.4 63.3 74.5 84.7 95.2 94.8

66.5 65.7 73.2 89.4 94.8 95.6

69.2 67.7 80.2 89.4 96.4 96.8

Table 12 Cluster accuracy for major difference in dispersion level Clustering method

2–2

2–3

2–4

3–2

3–3

3–4

Average

k-means Ward SOM Model based hVolume kVolume

65.6 64.3 78.8 95.4 99.4 99.5

62.1 65.0 75.5 96.8 98.8 99.5

69.5 66.6 77.3 93.7 99.2 99.7

53.8 54.1 69.2 91.5 98.5 99.1

55.5 51.4 66.2 90.0 97.9 99.7

47.0 50.7 68.6 92.3 98.6 99.2

61.1 60.8 74.2 93.4 98.6 99.3

1028

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

(diagonal covariance matrices). Between model-based and volume-based clustering, when we did not allow for outliers in the data ( = 0), both methods had similar cluster accuracy, but when a positive number of outliers was allowed, then volume-based clustering outperformed model-based clustering. 6. Summary and future research In this paper, we have proposed a new metric for clustering that is based on the MVE calculation of clusters. We showed that the new metric does not depend on the units of data measurement. We incorporated the concept of outliers in the new metric and showed that the proposed method of treating outliers recovers the data distribution asymptotically when the data comes from a single multivariate Gaussian distribution. We developed two clustering algorithms that simultaneously minimize the total cluster volume and identify outliers in the data. We demonstrated the effectiveness of the new clustering algorithms on a series of simulated data sets. One drawback of the proposed clustering algorithms is that they take more computer time than most traditional clustering algorithms, primarily due to the computation time involved in finding MVE. This limits the use of volumebased clustering to small to medium size data sets. By developing better MVE algorithms, by designing better clustering algorithms, or perhaps by clustering a sample of data points instead of clustering the entire data set, one could make volume-based clustering usable for very large data sets. Another direction for future research would be to extend the theory and empirical studies presented in this paper to elliptical distributed data. Finally, it would be useful to characterize real-world problems where volume-based clustering works well. Acknowledgments The authors are grateful to Nitin R. Patel, Center for E-business Research at MIT, and Research Resource Committee at Rutgers Business School for their valuable suggestions and financial support. References [1] Jain AK, Dubes RC. Algorithms for clustering data. Englewood Cliffs, NJ: Prentice-Hall; 1988. [2] Ward JH. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 1963;58:236–44. [3] Knorr E, Ng R, Zamar R. Robust space transformations for distance-based operations. In: Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining 2001. p. 126–35. [4] Barnett V, Lewis T. Outliers in statistical data. New York: Wiley; 1994. [5] Hardin J, Rocke D. Outlier detection in multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics and Data Analysis 2004;44:625–38. [6] Santos-Pereira CM, Pires AM. Detection of outliers in multivariate data: a method based on clustering and robust estimators. In: Proceedings of the 15th symposium in computational statistics. 2002. p. 291–6. [7] Tong YL. The multivariate normal distribution. Berlin: Springer; 1990. [8] Jolion J, Meer P, Bataouche S. Robust clustering with applications in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 1991;13(8):791–802. [9] Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. New York: Wiley; 1987. [10] Vandenberghe L, Boyd S, Wu SP. Determinant maximization with linear matrix inequality constraints. SIAM Journal on Matrix Analysis and Applications 1998;19(2):499–533. [11] Sun P, Freund RM. Computation of minimum-volume covering ellipsoids. Operations Research 2004;52(5):690–706. [12] Friedman HP, Rubin J. On some invariant criteria for grouping data. American Statistical Association Journal 1967; 1159–78. [13] Mahalanobis PC. On the generalized distance in statistics. Proceedings of the National Institute of Science of India, vol. 2, 1936. p. 49–55. [14] Marriott FHC. Optimization methods of cluster analysis. Biometrika 1982;69(2):417–21. [15] Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal of American Statistical Association 2002;97:611–31. [16] Rosen JB. Pattern separation by convex programming. Journal of Mathematical Analysis and Applications 1965;10:123–34. [17] Barnes ER. An algorithm for separating patterns by ellipsoids. IBM Journal of Research and Development 1982;26(6):759–64. [18] Tamez-Pena JG, Perez A. Robust parallel clustering algorithm for image segmentation. In: Proceedings of the international society for optical engineering. 1996. p. 737–48. [19] Dave RN, Krishnapuram R. Robust clustering methods: a united view. IEEE Transactions on Fuzzy Systems 1997;5(2):270–93. [20] Pena J, Lozano J, Larranaga P. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters 1999;50:1027–40.

M. Kumar, J.B. Orlin / Computers & Operations Research 35 (2008) 1017 – 1029

1029

[21] Mangiameli P, Chen SK, West D. A comparison of SOM neural network and hierarchical clustering methods. European Journal of Operational Research 1996;93:402–17. [22] Martinez AR, Martinez WL. Model-based clustering toolbox for MATLAB. http://www.stat.washington.edu/fraley/mclust, 2004. [23] Milligan GW. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 1980;45(3): 325–42. [24] Milligan GW. An algorithm for generating artificial test clusters. Psychometrika 1985;50(1):123–7.