Pattern Recognition Letters 28 (2007) 1981–1986 www.elsevier.com/locate/patrec
Inappropriateness of the criterion of k-way normalized cuts for deciding the number of clusters Ayumu Nagai
*
Department of Computer Science, Gunma University, 1-5-1 Tenjin-cho, Kiryu, Gunma 376-8515, Japan Received 16 February 2006; received in revised form 20 April 2007 Available online 12 June 2007 Communicated by R.P.W. Duin
Abstract Spectral clustering is a completely different algorithm from other existing clustering algorithms in that it relies on a linear algebraic approach including spectral decomposition. Normalized Cuts is a representative algorithm of spectral clustering. It incorporates a criterion for deciding the number k of clusters to partition. This paper shows that the criterion is not appropriate for deciding k. We showed this by proving that the optimal bipartition (that is, when k = 2) becomes the optimal clustering. Namely, based on the criterion, the evaluation becomes better when k is small. We also show that the criterion is inappropriate for comparing approximate solutions with various k. Especially we prove that a ^ ^ can be constructed from H ^ ^ within the time complexity at most bipartition which surpasses the best given approximate solution H k k ^ ^. Oð^k 3 Þ, where ^k is the number of clusters contained in H k Based on these two reasons, the Normalized Cuts Criterion is not appropriate for deciding k. An alternative criterion is necessary. 2007 Elsevier B.V. All rights reserved. Keywords: Clustering; Spectral clustering; Number of clusters; Cluster validation
1. Introduction Clustering algorithms, which are used for unsupervised classification, classifies a given dataset into some groups called clusters, intending to recognize a bunch of similar data as a cluster. There are two major goals of clustering algorithms, that is, (1) to find a proper partition and (2) to decide a number of clusters. Since the ‘‘proper’’ partition is difficult to define, many studies have been invested in the first goal. Whether directly or indirectly, some of the studies deal with the second goal. In this paper, we mainly discuss the second goal, that is, the decision of a number of clusters. Among various algorithms, we focus on an algorithm called spectral clustering (Dhillon, 2001; Ding et al., 2001; *
Tel.: +81 277 30 1809; fax: +81 277 30 1801. E-mail address:
[email protected]
0167-8655/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2007.05.020
Kannan et al., 2004; Ding, 2004). Spectral clustering is a quite unique algorithm from other algorithms in the sense that it uses spectral decomposition in order to obtain an approximate solution. The methods for spectral clustering are classified into several types depending on the objective function to use. Among them, we focus on a representative method called Normalized Cuts (Shi and Malik, 2000). Normalized Cuts is mostly used for image segmentation (Malik et al., 2001; Yu and Shi, 2003; Cour et al., 2005). Besides, there are applications to biological data (Pentney and Meila˘, 2005; Dhillon et al., 2004). The idea of Normalized Cuts (Shi and Malik, 2000) is as follows. The problem is similar to the min-cut problem in the field of graph theory, since a cut of a graph is similar to a classification of the dataset. However, the minimal cut tends to become a cut that separates an outlier data. Such a cut is meaningless in the actual cases, when the sizes of the clusters are exceedingly one-sided. Therefore, Shi
1982
A. Nagai / Pattern Recognition Letters 28 (2007) 1981–1986
proposed a criterion, which gives a good evaluation when the sizes of the clusters are equalized. This is the idea of Normalized Cuts. When we regard every individual input data as a node, and the similarity between two data as a weight on the edge between the nodes corresponding to the data, we can consider an input of the problem as a weighted graph. Therefore, we assume that a weighted graph G ¼ ðV; E; WÞ is given, where V is a set of all nodes (corresponding to the input data), E is a set of edges connecting the nodes, and W is a similarity matrix which denotes the similarity between any two input data. The matrix W is assumed as a nonnegative and symmetric n · n matrix. Clustering n input data into k clusters is to group V into k disjoint sets def denoted as CkV ¼ fV1 ; V2 ; . . . ; Vk g which satisfies Sk k a¼1 Va ¼ V and Va \ Vb ¼ ; for all a,b when a5b. CV corresponds to a clustering. We simply define V, which is a set of n input data, as {1, 2, . . . , n} in this paper. Let A; B V. We define links between two sets A and B by XX def ðWÞij ð1Þ linksðA; BÞ ¼ i2A j2B
which links is the same as cut except that links from A to A itself can be defined. We define the degree of a set as the total links to all the nodes. That is, def
degreeðAÞ ¼ linksðA; VÞ
ð2Þ
The linkratio between two sets is defined as follows. The degree is used for the normalization def
linkratioðA; BÞ ¼
linksðA; BÞ degreeðAÞ
ð3Þ
The linkratio between A and B is the proportion of the links to B to the total links that A has. Then, the formulation of Normalized Cuts is to minimize the following criterion: def
F k ðCkV Þ ¼
k 1X linkratioðVa ; V n Va Þ k a¼1
optimal bipartition (i.e., k = 2) becomes the optimal (w.r.t. k) clustering based on the criterion Fk. F k , which denotes the evaluation of the optimal k-way partition, becomes a monotone increasing function w.r.t. k. In other words, Fk has a bias that small k is favorable. Therefore, we cannot use the criterion Fk in order to decide k. 2. Normalized Cuts Criterion An objective function of Normalized Cuts is usually expressed in the form of linear algebra. We denote the partition CkV by n · k ‘‘partition matrix’’ H. Let def ½h1 ; h2 ; . . . ; hk ¼ H, that is, ha is a binary indicator vector for a cluster Va . The ith element of ha is 1 if and only if the ith data is P a member of a cluster Va . (Otherwise it is k 0.) Note that a¼1 ha ¼ 1n , since each data is a member of a single cluster. In other words, 1 ði 2 Va Þ def ðHÞia ¼ ð1 6 i 6 n; 1 6 a 6 kÞ ð5Þ 0 ði 62 Va Þ We define an n · n degree matrix D as follows: 8 n < PðWÞ ði ¼ jÞ def il ð1 6 i; j 6 nÞ ðDÞij ¼ l¼1 : 0 ði 6¼ jÞ D is a diagonal matrix. Note that W1n = D1n. Then, links and degree can be denoted as follows: linksðVa ; Vb Þ ¼ hTa Whb ¼ hTb Wha degreeðVa Þ ¼
hTa Dha
ð7Þ ð8Þ
Then, the criterion Fk of Normalized Cuts can be denoted as Eq. (9) def
Fk ¼
¼
ð4Þ
‘‘linkratioðVa ; V n Va Þ’’ is the proportion of the links exiting outside from a cluster Va to the total links in Va . Fk, the criterion of Normalized Cuts, is the average linkratio over all clusters. So far, we have assumed that k, i.e., the number of clusters, is given. In this case, since k is fixed, only the partition CkV has to be determined under the criterion Fk. However, it is often the case that k is not given, depending on the problems. In this case, k has to be determined as well as a partition CkV . This is a more difficult problem than a problem in which k is given. As a matter of fact, the criterion Fk is a representative criterion of Normalized Cuts to decide k in the problems where k is not given (Yu and Shi, 2003; Meila˘ and Xu, 2003; Dhillon et al., 2004; Yu and Shi, 2004). In this paper, we show that the criterion Fk is not appropriate for deciding k. In concrete terms, we prove that the
ð6Þ
k 1X linkratioðVa ; V n Va Þ k a¼1 k 1X ðlinkratioðVa ; VÞ linkratioðVa ; Va ÞÞ k a¼1
¼1
k k 1X 1X hTa Wha linkratioðVa ; Va Þ ¼ 1 k a¼1 k a¼1 hTa Dha
ð9Þ
Normalized Cuts results in a minimization problem expressed as Eq. (9). Fk(H) explicitly denotes the evaluation of a partition H by the criterion Fk. Hk explicitly denotes the k-way partition H. When k is given, the criterion for Normalized Cuts is defined as follows: min Hk
s:t:
F k ðH k Þ 8 nk > < H k 2 f0; 1g k P > : ha ¼ 1n
ð10Þ
a¼1
However, when k is not given, k has to be determined as well as a partition H. In that case, the problem is defined as (Yu and Shi, 2003, 2004; Meila˘ and Xu, 2003; Dhillon et al., 2004)
A. Nagai / Pattern Recognition Letters 28 (2007) 1981–1986
F k ðH k Þ 8 nk > < H k 2 f0; 1g k P > : ha ¼ 1n
min min k
Hk
s:t:
ð11Þ
a¼1
What we want to prove in this paper is Theorem 3.1. Theorem 3.1. For any given k P 2, mink minH k F k ðH k Þ ¼ minH 2 F 2 ðH 2 Þ. The claim of Theorem 3.1 is as follows. It is when k = 2 that the criterion Fk is minimized. There may be another minimal solution when k > 2, but at least there exists an optimal solution when k = 2. The strategy of our proof is as follows. First, we prove a proposition that, for any k, the optimal k-way partition surpasses (or is not worse than) the optimal (k + 1)-way partition. Here, we show that, by merging some pair of clusters in the optimal (k + 1)-way partition, we can construct a new k-way partition whose evaluation excels the original optimal (k + 1)-way partition. Since this new kway partition is not better than the optimal k-way partition obviously, the proposition holds. By applying the above proposition iteratively, Theorem 3.1 is derived. That is, Fk is optimal (w.r.t. k) when k = 2 (the minimum partition). For the actual proof, we use another criterion Gk instead of Fk. A criterion Gk is defined as follows: def
Gk ðH k Þ ¼
k F k ðH k Þ k1
ð12Þ
Gk is defined as an optimal k-way partition under the criterion Gk. That is, def
Gk ¼ min Gk ðH k Þ
ð13Þ
Hk
def
F k is also defined in a similar way, such as F k ¼ minH k F k ðH k Þ. The optimal k-way partition H k is defined as follows: def
H k ¼ arg min F k ðH k Þ
Proof of Lemma 3.2. For any given k, suppose that the optimal (k + 1)-way partition H kþ1 , whose evaluation Gkþ1 is defined as Eq. (16), is given. ! kþ1 X 1 Wi def k þ 1 F kþ1 ¼ kþ1 Gkþ1 ¼ ð16Þ k k Di i¼1 def
3. Theorem and its proof
1983
def
T where W i ¼ hT i Whi , Di ¼ hi Dhi (1 6 i 6 k + 1), and H kþ1 ¼ ½h1 ; h2 ; . . . ; hkþ1 is the optimal (k + 1)-way partition. On the other hand, the evaluation of a k-way partition is defined as Eq. (17). ! k X k 1 W 0i def Fk ¼ k Gk ¼ ð17Þ k1 k1 D0i i¼1 def
def
0 0T 0 0 where W 0i ¼ h0T i Whi and Di ¼ hi Dhi (1 6 i 6 k). We consider a k-way partition which is obtained by merging two clusters Va and Vb (a < b) selected from the k + 1 clusters (see Fig. 1). That is, we only consider h0i which is defined as follows: 8 ha þ hb ði ¼ 1Þ > > > < h ð2 6 i < aÞ def i1 ð18Þ h0i ¼ > hi ða < i < bÞ > > : hiþ1 ðb < i 6 kÞ
By Eq. (17), kþ1 X 1 W i W a W b W 01 Gk ¼ k þ þ k1 Di Da Db D01 i¼1
By Eq. (16) and Eq. (19), Gkþ1
kþ1 X Wi P Gk () ðk 1Þ k þ 1 Di i¼1
! ð19Þ
!
! kþ1 X W i W a W b W 01 Pk k þ þ () 1 Di Da Db D01 i¼1 kþ1 X Wi W a W b W 01 þ k þ P0 Di Da Db D01 i¼1
ð20Þ
ð14Þ
Hk
Since Gk(Hk) equals a constant times Fk(Hk) for any Hk, H k ¼ arg min Gk ðH k Þ
ð15Þ
Hk
def
Let ½h1 ; h2 ; . . . ; hk ¼ H k . Note that Gk ¼ Gk ðH k Þ and F k ¼ F k ðH k Þ. Lemma 3.2. For any given k P 2, Gkþ1 P Gk . Lemma 3.2 claims that the optimal k-way partition surpasses (or is equal to) the optimal (k + 1)-way partition under the criterion G.
Fig. 1. An example of a (k + 1)-way partition when k = 3. A k-way partition is obtained by merging two clusters Va and Vb (a < b).
1984
A. Nagai / Pattern Recognition Letters 28 (2007) 1981–1986
We are going to show that there is a pair a and b (1 6 a < b 6 k + 1) which satisfies Eq. (20). In order to show this by proof by contradiction, we first assume to the contrary that Eq. (20) is not satisfied, for any pair of a and b. 8a; 8b ða 6¼ bÞ ðleft-hand side of Eq: ð20ÞÞ < 0
¼
! kþ1 kþ1 kþ1 X X X Wi Wi W i kD kD k 2 1 þ Di Di i¼1 i¼1 i¼1 þk
kþ1 X
W i þ k2
kþ1 X i
i¼1
¼ kD þ k
kþ1 X
() 8a; 8bða 6¼ bÞ ðleft-hand side of Eq: ð20ÞÞ
hT a Whb
16a¼b6kþ1
16a
¼ kD þ k
ðDa þ Db Þ < 0
kþ1 X kþ1 X
ð21Þ
X
2hT a Whb
16a
X
¼ kD þ k
2hT a Whb
16a
Wiþk
i¼1
ðDa þ Db Þ < 0 X ½ðleft-hand side of Eq: ð20ÞÞ )
X
Wiþk
þ
X
! 2hT a Whb
16a
T hT a Whb ¼ kD þ k1n W1n
a¼1 b¼1
¼ kD þ kD ¼ 0 We then show that the left-hand side of Eq. (21) is equal to 0, which results in a contradiction. To that end, we use Eq. (22) and Eq. (23) X W a W b þ ðDa þ Db Þ Da Db 16a
kþ1 X a¼1
¼k
kþ1 X
kþ1 X Wa Waþ ðD Da Þ Da a¼1
WiþD
i¼1
Since the left-hand side of Eq. (21) is equal to 0, Eq. (21) is a contradiction. Therefore, it always exists a pair of clusters Va and Vb which satisfies Eq. (20). Thus, Gkþ1 P Gk holds, where Gk is an evaluation of the new k-way partition which is constructed by merging the two clusters. On the other hand, Gk P Gk obviously holds, that is, this new k-way partition is not better than the optimal k-way partition. Hence, Gkþ1 P Gk P Gk . h Lemma 3.3. For any given k P 2, F kþ1 P F k . Proof of Lemma 3.3. It is easily shown by Lemma 3.2. k k1 G Gk k þ 1 kþ1 k k k2 1 Gkþ1 ¼ G k kþ1 k2 k ðG Gk Þ P k þ 1 kþ1 P 0 ðby Lemma 3:2Þ
F kþ1 F k ¼
kþ1 kþ1 X Wi X Wi Di i¼1 i¼1
ð22Þ
def Pkþ1 T T where D ¼ i¼1 Di ¼ 1n D1n ¼ 1n W1n . X X W 01 ¼ ðha þ hb ÞT Wðha þ hb Þ 16a
16a
X
¼
ðW a þ W b Þ þ
16a
¼k
kþ1 X i
D01
ðha
ð24Þ
X
ð25Þ
The proof of Theorem 3.1, the goal of this paper, is as follows.
2hT a Whb
16a
X
Wiþ
2hT a Whb
ð23Þ
16a
hb ÞT Dðha
hb Þ
Since ¼ þ þ ¼ Da þ Db , we have the following fact due to Eq. (22) and Eq. (23): ðthe left-hand side of Eq: ð21ÞÞ " # ! kþ1 X X Wi ¼ 1 þ ðDa þ Db Þ Di 16a
Proof of Theorem 3.1. By applying Lemma 3.3 iteratively, we have F k P F k1 P P F 3 P F 2 . Theorem 3.1 is proved, since min min F k ðH k Þ ¼ min F 2 ðH 2 Þ () min F k ¼ F 2 k
Hk
H2
k
Hence, the criterion Fk, which is widely used for Normalized Cuts, becomes the optimal (w.r.t. k) when k = 2. This fact shows that it is not appropriate to use the criterion Fk for deciding k, that is, the number of clusters to partition. 4. Inappropriateness of a comparison of approximate solutions In this section, we briefly show that the criterion Fk is not appropriate to compare approximate solutions that are obtained individually.
A. Nagai / Pattern Recognition Letters 28 (2007) 1981–1986
1985
For any given k P 2, a problem for optimizing the kway partition is known to be NP hard (Meila˘ and Xu, 2003). Therefore, k is actually decided by comparing approximate solutions. Suppose that we are given approx^ k (2 6 k 6 K). Let F^ k def ^ k Þ. Note imate solutions H ¼ F k ðH def ^ ^ ^ ^ that F k P F k . Let k ¼ arg mink F k , that is, a k-way parti^ ^k is the best solution among the given K 1 approxtion H imate solutions. Theorem 4.1. A bipartition H 02 which satisfies F^ ^k P F ðH 02 Þ ^ ^ within the time complexity at can be constructed from H k 3 ^ most Oðk Þ. We only give an outline of a proof for Theorem 4.1, since it is almost the same as Lemma 3.3. In the following ^ 0 Þ is abbreviated as F 0 . discussion, the term F i ðH i i We assumed that an ‘‘optimal’’ (k + 1)-way partition is given for proving Lemma 3.3 (and Lemma 3.2). Actually, we can have a same discussion even when an ‘‘arbitrary’’ (k + 1)-way partition is given, which leads to a generalized lemma of Lemma 3.3 denoted as Lemma 3.3 0 . Lemma 3.3 0 ^ ^, can be applied to the best-known approximate solution H k 0 ^ which leads to a (k 1)-way partition H ^k1 which surpasses the best solution, because F^ ^k P F ^0k1 . By applying Lemma 3.3 0 iteratively, we can finally construct a bipartition H02 which surpasses the best solution. Note that F^ ^k P F 0^k1 P P F 02 . H 02 is constructed by applying Lemma 3.3 0 ^k 2 times. The time complexity to apply Lemma 3.3 0 once is at most Oð^k 2 Þ, since the number of combinations to select two clusters Va and Vb from the ^k clusters is Oð^k 2 Þ. Therefore, H 02 can be obtained within the time complexity at most Oð^k 3 Þ. As an example, we pick up a case of Fig. 2. The best^ 4 , since ^k ¼ 4. known approximate solution in Fig. 2 is H By merging an appropriate pair of clusters composing ^ 4 , we can construct a 3-way partition H 0 which surpasses H 3 ^ 4 . H 0 is denoted as a white circle in Fig. 3. In the same H 3 way, we can construct a bipartition H 02 which surpasses
^k Fig. 2. The best solution of the given approximate solutions H ^ 4 (i.e., when ^k ¼ 4). H ^ are the optimal solutions which (2 6 k 6 K) is H k are unknown.
Fig. 3. A 3-way partition H 03 is constructed by merging an appropriate pair ^ 4 . Note that F 0 surpasses F^ 4 . In the same way, a of clusters composing H 3 ^ 4 is constructed by merging an approbipartition H 02 which surpasses H priate pair of clusters composing H 03 . In this way, the bipartition H 02 can be ^ 4. constructed which surpasses the best-known approximate solution H
H 03 , by merging an appropriate pair of clusters composing H 03 . ^ 0 can be obtained which In this way, the bipartition H 2 surpasses (or is equal to) the best-known approximate solu^ ^ . Therefore, Fk is not appropriate for deciding k by tion H k comparing approximate solutions. 5. Discussions In this section, we discuss why it was not noticed so far that H k exceeds H kþ1 nor F k is a monotone increasing function (w.r.t. k). We believe that there are two major reasons. (1) First of all, the optimal k-way solution H k is difficult to obtain, since it is an NP hard problem. Because of its difficulty, it is common that we can ^ k in practice. only obtain an approximate solution H Indeed, Normalized Cuts is one of the method to obtain approximate solutions instead of optimum solutions. ^ k is not always the partition (2) Besides, in general, H obtained by merging certain two clusters constructing ^ kþ1 , as long as H ^ k and H ^ kþ1 are obtained individuH ally by Normalized Cuts. Normalized Cuts uses spectral decomposition in order to obtain approximate solutions. However, we only use k eigenvectors (including a trivial eigenvector), leaving all the other eigenvectors unused. In short, eigenvectors used for ^ k and H ^ kþ1 are different. Therefore, genobtaining H ^ k is not always the partition obtained by erally, H ^ kþ1 . merging certain two clusters constructing H Because of the above two reasons, it was not noticed so far that H k exceeds H kþ1 nor F k is a monotone increasing function (w.r.t. k).
1986
A. Nagai / Pattern Recognition Letters 28 (2007) 1981–1986
6. Related works In terms of the criteria in the field of spectral clustering, Ratio-Cut (Hagen and Kahng, 1992) and Min–Max Cut (Ding et al., 2001) are familiar. Their difference between Normalized Cuts is the definition of degree (i.e., Eq. (8)). degreeðVÞ for Ratio-Cut is defined as the number of data included in the cluster V. degreeðVÞ for Min–Max Cut is defined as linksðV; VÞ, which shows how strong each data in V is connected with each other. Among them, Normalized Cuts is the most practically used criterion in the field of spectral clustering. Deciding the number of clusters (or model order) is a part of cluster validation. There is an approach called resampling (Levine and Domany, 2001; Tibshirani et al., 2001), whose basic idea is to find a partition with high stability against resampling the dataset. Resampling is to select a subset of data out of the given dataset. The stability of a certain partition is empirically estimated by the average probability that the cluster membership do not change by the repeated cluster analysis for each resampled data. An advantage of resampling is that a partition of a stable local optimum can be obtained, while it has a disadvantage of time-consumption because of the repeated cluster analysis. In terms of methods to decide the number of clusters, there are still more approaches. We briefly introduce three types of approaches among them. The first approach is based on log-likelihood with penalty, such as Akaike Information Criterion (AIC) (Akaike, 1974) and Bayesian Information Criterion (BIC) (Schwarz, 1978). Both of them are adopted by SPSS (Norusis, 2005) which is one of the most commonly used statistical tools. The second approach is based on coding-length such as Minimum Description Length (MDL) (Kontkanen et al., 2003). The last one is full Bayesian approach whose representative implementation is AutoClass (Cheeseman and Stutz, 1996), which is for soft clustering. 7. Conclusion Normalized Cuts is a representative algorithm of spectral clustering, where Fk is widely used as a criterion for deciding the number k of clusters to partition. However, it is not appropriate to use Fk for deciding k. We showed this by proving that the optimal bipartition (when k = 2) becomes the optimal (w.r.t. k) clustering based on Fk. Moreover, we proved that F k , i.e., the evaluation of the optimal k-way partition, becomes a monotone increasing function w.r.t. k. Therefore, we cannot use the criterion Fk in order to decide k.
It is also inappropriate to decide k by comparing approximate solutions with various k. It is because we can construct a bipartition, which surpasses the best given approximate solution within the time complexity at most ^ ^. Oð^k 3 Þ, where ^k is the number of clusters contained in H k Therefore, the criterion Fk is not appropriate for deciding k. An alternative criterion is necessary. References Akaike, H., 1974. A new look at the statistical identification model. IEEE Trans. Automat. Control 19, 716–723. Cheeseman, P., Stutz, J., 1996. Bayesian classification (AutoClass): theory and results. In: Fayyad, U., Shapiro, G.P., Smyth, P., Uthurusamy, R. (Eds.), Advances in Knowledge Discovery and Data Mining. AAAI Press, pp. 153–180. Cour, T., Be´ne´zit, F., Shi, J., 2005. Spectral segmentation with multiscale graph decomposition. In: Conf. on Computer Vision and Pattern Recognition, pp. 1124–1131. Dhillon, I.S., 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In: Internat. Conf. on Knowledge Discovery and Data Mining, pp. 269–274. Dhillon, I.S., Guan, Y., Kulis, B., 2004. Kernel k-means, spectral clustering and normalized cuts. In: Internat. Conf. on Knowledge Discovery and Data Mining, Poster Session: Research track posters, pp. 551–556. Ding, C., He, X., 2004. Linearized cluster assignment via spectral ordering. In: Proc. 21st Internat. Conf. on Machine Learning. Ding, C.H.Q., He, X., Zha, H., Gu, M., Simon, H.D., 2001. A min–max cut algorithm for graph partitioning and data clustering. In: Proc. IEEE Internat. Conf. on Data Mining, pp. 107–114. Hagen, L., Kahng, A., 1992. New spectral methods for ratio-cut partitioning and clustering. IEEE Trans. Comput.-Aided Des. 11 (9), 1074–1085. Kannan, R., Vempala, S., Vetta, A., 2004. On clusterings: Good, bad and spectral. J. ACM 51 (3), 497–515. Kontkanen, P., Vempala, S., Vetta, A., 2003. An MDL framework for data clustering. HIIT Technical Report 2004-6. Levine, E., Domany, E., 2001. Resampling method for unsupervised estimation of cluster validity. Neural Comput 13 (11), 2573–2593. Malik, J., Belongie, S., Leung, T.K., Shi, J., 2001. Contour and texture analysis for image segmentation. Internat. J. Comput. Vision 43 (1), 7– 27. Meila˘, M., Xu, L., 2003. Multiway cuts and spectral clustering. In: Advances in Neural Information Processing Systems. Norusis, M., 2005. SPSS 13.0 Statistical Procedures Companion. PrenticeHall. Pentney, W., Meila˘, M., 2005. Spectral clustering of biological sequence data. In: American Association for Artificial Intelligence, pp. 845–850. Schwarz, G., 1978. Estimating a dimension of a model. Ann. Statist. 6, 461–464. Shi, J., Malik, J., 2000. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Machine Intell. 22 (8), 888–905. Tibshirani, R., Walther, G., Botstein, D., Brown, P., 2001. Cluster validation by prediction strength. Technical Report, Department of Biostatistics, Stanford University. Yu, S., Shi, J., 2003. Multiclass spectral clustering. In: Internat. Conf. on Computer Vision. Yu, S., Shi, J., 2004. Segmentation given partial grouping constraints. IEEE Trans. Pattern Anal. Mach. Intell. 26 (2), 173–183.