Information Processing and Management 49 (2013) 871–883
Contents lists available at SciVerse ScienceDirect
Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman
Learning a subspace for clustering via pattern shrinking Chenping Hou a,⇑, Feiping Nie b, Yuanyuan Jiao a, Changshui Zhang b, Yi Wu a a b
Department of Mathematics and Systems Science, National University of Defense Technology, Changsha 410073, China Department of Automation, Tsinghua University, Beijing 100084, China
a r t i c l e
i n f o
Article history: Received 4 April 2011 Received in revised form 8 January 2013 Accepted 10 January 2013 Available online 5 March 2013 Keywords: Clustering Subspace learning Pattern shrinking
a b s t r a c t Clustering is a basic technique in information processing. Traditional clustering methods, however, are not suitable for high dimensional data. Thus, learning a subspace for clustering has emerged as an important research direction. Nevertheless, the meaningful data are often lying on a low dimensional manifold while existing subspace learning approaches cannot fully capture the nonlinear structures of hidden manifold. In this paper, we propose a novel subspace learning method that not only characterizes the linear and nonlinear structures of data, but also reflects the requirements of following clustering. Compared with other related approaches, the proposed method can derive a subspace that is more suitable for high dimensional data clustering. Promising experimental results on different kinds of data sets demonstrate the effectiveness of the proposed approach. Ó 2013 Elsevier Ltd. All rights reserved.
1. Introduction The task of clustering is to cluster similar data into the same group and divide dissimilar data into different groups. It is an important research direction in many fields, such as statistics, pattern recognition, and information processing. The clustering problem has been addressed by many researchers in many disciplines (Jain, Murty, & Flynn, 1999; Gondek & Hofmann, 2007). This reflects its broad appeal and usefulness in exploratory data mining. It is regarded as a basic preprocessing technique and has been widely used in many real applications. For example, in web searching, if we can cluster different kinds of web texts effectively, it will be convenient for discovering the latent similarities among different web pages. The efficiency of searching for an interested web can also be improved considerably. Due to the importance of clustering, a lot of researches have been emerged to address this problem. Existing clustering approaches can be broadly classified into four different types (Shi, Song, & Zhang, 2005): partitioning, hierarchical, gridbased and density-based algorithms (Guha, Rastogi, & Shim, 2000; Agrawal, Gehrke, Gunopulos, & Raghavan, 1998; Ankerst, Breunig, Kriegel, & Sander, 1999; Kaufman & Rousseeuw, 1990). These approaches have been widely used and achieved promising results when the samples’ dimensionality is low. In many real applications, e.g., image clustering and web mining, the dimensionality of data is very high. For example, for an image of d1 d2 pixels, its vector representation is a (d1 d2)dimensional data point (Belhumeur, Hespanha, & Kriegman, 1996). For a web page, its representation is often a high dimensional vector too (Yu, Bi, & Tresp, 2006). In the case of clustering high dimensional data, previous researches have shown that the efficiencies of traditional clustering methods are degraded with increasing dimensions (Ding & Li, 2007; Ye, Zhao, & Wu, 2007). To cluster high dimensional data, a lot of researches are proposed to first reduce the dimensionality of original data and then to cluster the computed low dimensional embeddings. This is mainly due to the rapid increase of researches about manifold learning, whose aim is to learn a low dimensional representation for high dimensional data. Among these ⇑ Corresponding author. Tel.: +86 731 8457 3238. E-mail address:
[email protected] (C. Hou). 0306-4573/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ipm.2013.01.004
872
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
(a)
(b)
5
(c)
(d)
4
1 2
2 0
0.5
0
0
−2 −5 −5
0
5
−4 −4
0 −0.5
−2 −2
0
2
4
−2
0
2
−1 −1
0
1
2
Fig. 1. Toy examples of pattern shrinking results Y on two circle data. (a) Original data; (b) shrunk patterns with a = 0.01; (c) shrunk patterns with a = 0.003; and (d) shrunk patterns with a = 0.0001; see more details in Section 2.
dimensionality reduction approaches, subspace learning, i.e., reducing the dimensionality linearly, has been widely investigated due to its simplicity, effectiveness and inductive nature (Ding & Li, 2007; Ye et al., 2007). Thus, we will also focus on the subspace learning methods. Among the approaches that aim to learn a subspace for clustering, the most successful one is to employ the traditional Principle component analysis (PCA) (Duda, Hart, & Stork, 2000) to reduce the dimensionality of original data at first, and then to apply Kmeans (Xu & Wunsch, 2005) for clustering. For brief, we name it PcaKm. Recently, Ding et al. extended the traditional supervised Linear discriminant analysis (LDA) method (Duda et al., 2000) for clustering. It is named as LdaKm (LdaKm) (Ding & Li, 2007). Almost at the same time, Ye et al. also proposed a new approach, which is named as DisKmeans (Ye et al., 2007). Essentially, both LdaKm and DisKmeans compute the clustering results and learn the subspace alternatively. Some other subspace learning approaches, e.g., Locality preserving embedding (LPP), have also been used before clustering (He & Niyogi, 2003). Beside, Li et al. proposed an adaptive subspace iteration algorithm for document clustering (Li, Ma, & Ogihara, 2004). Gu et al. also proposed a subspace clustering method based on maximum margin criterion (SMMC) (Gu & Zhou, 2009). These approaches perform well in many applications. However, their performance can also be improved since (1) some of these approaches, such as PcaKm, separate the procedures of dimensionality reduction and clustering. In other words, the dimensionality reduction approaches are not particularly designed for clustering. (2) Some of these subspace learning approaches, such as LdaKm, adaptive approach and SMMC, use linear projection for dimensionality reduction. They have not taken fully considerations about the nonlinear structure of data points. In this paper, we aim to find a linear subspace which can incorporate both the linear and nonlinear structures of original data. Besides, it is particular designed for clustering. Inspired by the framework of semi-supervised learning (Zhou, Bousquet, Lal, Weston, & Schölkopf, 2003), a new strategy to shrink the pattern and simultaneously learn the subspace is then proposed. Different from previous subspace learning methods that take the original data as the input, we replace the original data by their shrunk patterns. This replacement not only integrates both the linear and nonlinear manifold structures, but also considers the requirements of subsequent clustering procedure. In the following, we name the proposed approach as Subspace Learning via Pattern Shrinking (SLPS). Taking Fig. 1 as an intuitive example, we shrink the original data (Fig. 1a) to the patterns shown in Fig. 1d. Thus, it is more convenient to find a suitable subspace for clustering. More interestingly, if we change one parameter of our algorithm, i.e., a, the small circle could move outside of the big circle step by step. See more details in Section 3. This paper is structured as follows. Section 2 presents brief introductions about related works. We describe the proposed subspace learning via pattern shrinking model in Section 3. Section 4 shows some experimental results, followed by the conclusions in Section 5. 2. Related work In this section, we will review some related works. The first is about two typical subspace learning approaches and the second is mainly about a framework of semi-supervised learning. 2.1. PcaKm and LdaKm Assume that each data is represented by a r-dimensional vector and there are totally n samples for clustering, i.e., {x1, x2, . . . , xn}, where xi 2 Rr for i = 1, 2, . . . , n. In subspace clustering, we aim to compute a transformation matrix P 2 Rrd , where d is dimensionality of the subspace. Then, each sample is projected into a low dimensional subspace by zi = PTxi and zi is the embedding of xi. Denote X = [x1, x2, . . . , xn], Z = [z1, z2, . . . , zn]. Therefore, Z = PTX. After projecting each data, we can apply some other clustering techniques, such as Kmeans, in the subspace for clustering. We now introduce two typical subspace learning approaches for clustering, i.e., PcaKm and LdaKm (Ding & Li, 2007). Aspffiffiffiffi sume F 2 Rnc is the cluster indicator matrix. F ij ¼ 1= nj if and only if xi belongs to the jth cluster. Otherwise, Fij = 0. Here c is the number of clusters, nj is the number of samples in the jth cluster.
873
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
In PcaKm, it first employs the PCA technique to reduce the dimensionality. In other words, it tries to find the projection matrix P, which is the solution to the following problem.
1 arg maxtr PT X I 11T X T P P n s:t: PT P ¼ I
ð1Þ
where 1 is a n-dimensional column vector with all ones. Then, PcaKm employs the Kmeans approach to compute the clusters, i.e., it aims to minimize the following equation.
arg min
c X X
C1 ;C2 ;...;Cc
ðzi mj ÞT ðzi mj Þ
ð2Þ
j¼1 zi 2Cj
where Cj is a set formed by the data in jth cluster and mj is their mean value. Recall the definition of F and the equation Z = PTX, we rewrite the objective function of Kmeans in Eq. (2) as follows.
arg mintrðPT XðI FF T ÞX T PÞ
ð3Þ
F
In summary, PcaKm first solves the problem in Eq. (1) to compute the optimal projection matrix P and then achieves the clustering results by solving the problem in Eq. (3). Different from PcaKm, LdaKm (Ding & Li, 2007; Ye et al., 2007) first employs Kmeans approach to cluster the original data and then uses LDA to compute the projection matrix. After projecting each sample, it employs Kmeans to cluster the sample in the low dimensional space and then applies LDA again. After repeating several times, we can compute the clusters and the projection matrix P simultaneously. More concretely, its objective function can be formulated in the following form.
1 ! 1 T T T arg mintr ðP XðI FF ÞX PÞ P X I 11 X P F;P n T
T
T
s:t: PT P ¼ I
ð4Þ
To solve this problem, LdaKm uses the alternative optimization strategy. It fixes one parameter and optimizes the other. More interestingly, this optimization procedure is equivalent to applying LDA and Kmeans alternatively. One point should be mentioned here. In traditional LDA, the dimensionality of subspace is at most c 1, where c is the number of categories (Duda et al., 2000). In order to learn a subspace of higher dimensionality, we must add a regularizer. The regularized LdaKm algorithm has the following formulation.
1 ! 1 T T T arg mintr ðP XðI FF ÞX PÞ P XðI 11 ÞX þ kI P F;P n T
T
T
s:t: P T P ¼ I where k is a regularization parameter. 2.2. Semi-supervised learning Since our method is inspired by a framework of some semi-supervised learning methods, we would like to provide a brief introduction about this framework. As shown in Zhou et al. (2003), some semi-supervised learning approaches employ a nonnegative weight Wij to measure the similarity between xi and xj. The goal of these approaches can be regarded as estimating a classification function f with the following constraints: (1) the computed labels on labeled data should be consistent with the given labels and (2) f should be smooth with respect to the data manifold. This will result in the following objective function.
Lðf Þ ¼ aGðf jX ðTÞ Þ þ bSðf Þ Here G is a loss function to measure the consistency between computed labels and given labels. X Sðf Þ is the smooth constraint, which punishes the smoothness of f and usually has the form
Sðf Þ ¼ f T Sf
ð5Þ ðTÞ
is the labeled data.
ð6Þ T
where f = [f(x1), f(x2), . . . , f(xn)] and S is an n n smoothness matrix, which has been used in Belkin et al.’s manifold regularization framework (Belkin, Niyogi, & Sindhwani, 2006). As we will show later, the formulation of pattern shrinking is similar to the function in Eq. (5). However, they are different in essence. Different from semi-supervised learning, since there is no labeled data for clustering, pattern shrinking aims to shrink the original pattern, not to predict the labels.
874
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
3. Subspace learning via pattern shrinking In this section, we will introduce some notations and formulate the subspace learning model by our pattern shrinking technique. After that, we show how to derive the approximated solution in a quick way. Finally, some preliminary discussions are provided. 3.1. Problem formulation Let us introduce some notations. Assume the shrunk patterns of these samples are defined by {y1, y2, . . . , yn}, where yi 2 Rr for i = 1, 2, . . . , n. In our settings, we plan to compute a projection matrix P. Each sample is projected into a low dimensional subspace by zi = PTyi. Here zi is the embedding of yi (correspondingly, xi) and P 2 Rrd . Denote Y = [y1, y2, . . . , yn], Z = [z1, z2, . . . , zn]. Therefore, Z = PTY. In summary, we would like to list the notations in Table 1. We begin to show our algorithm step by step. Evoked by the intuition of semi-supervised learning framework in Eq. (5), we try to optimize the following three objective functions. (1) Recall the requirement and basic assumption of clustering, i.e., nearby points are more likely to belong to the same cluster, we should design a weight matrix whose elements can measure the similarity of any two points and the shrunk patterns should maintain these similarities. Intuitively, if two points are nearby, they should belong to the same cluster and the corresponding weight should be large. On the contrary, if two points are far away, the corresponding element should be small. This objective can be implemented by employing the second term in Eq. (5). Nevertheless, different from the label propagation strategy that can propagate labels along the manifold in semi-supervised learning, we plan to shrink nearby samples as close as possible. (2) The shrunk pattern should keep the consistency with the original data. More concretely, the shrunk pattern and the original data should not be far away. Compared with the first objective function, which uses the local similarity for shrinking, this objective function can be regarded as keeping dissimilarity since we require that the shrunk pattern should be close to the original data. (3) After learning the shrunk pattern, we expect to learn a subspace in which the projections of shrunk patterns should maintain their original separations. It guarantees that lose of information in deriving shrunk data is as little as possible. We now explain the details of each step. The first step of our approach is to characterize the manifold structure of original data points, i.e., {x1, x2, . . . , xn}. We construct a k-nearest neighborhood graph, whose nodes are the points and whose edges are constructed by connecting every point to its k nearest neighbors. Denote that N ðiÞ is the index set of the points, who are k nearest neighbors of xi. The weight matrix W, associated with this k-nearest neighborhood graph, is computed by the following equation.
8 < exp kxi xj k2 xj 2 N ðiÞ or xi 2 N ðjÞ 2 2r W ij ¼ W ji ¼ : 0 otherwise
ð7Þ
where r is the width parameter of the Gaussian function. Obviously, the weight matrix W satisfies the requirement of our first objective. There are a couple of other methods which can define the weight matrix (Hofmann & Buhmann, 1998; RoblesKelly & Hancock, 2004), we choose the above strategy since it is simple and widely used (Belkin et al., 2006). Considering the definition of W, we keep the local similarity by minimizing the following equation. n X n X arg min W ij kyi yj k2 : Y
ð8Þ
i¼1 j¼1
For the second objective, we aim to keep the consistency between shrunk pattern and original data. Intuitively, the shrunk pattern should not be far way from the original data. Besides, we should also keep the data dissimilarity in
Table 1 Notations. r d xi 2 Rr1
Dimensionality of original space Dimensionality of embedding space The original high dimensional data
yi 2 Rr1
The shrunk pattern of xi
zi 2 Rd1 P2R X Y Z
rd
The projection of yi (or xi), i.e., zi = PTyi The projection matrix Data matrix of original samples, i.e., X = [x1, x2, . . . , xn] Data matrix of shrunk patterns, i.e., Y = [y1, y2, . . . , yn] Data matrix of projections, i.e., Z = [z1, z2, . . . , zn]
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
875
minimizing Eq. (8). It also requires that the shrunk pattern consists with the original data. We minimize the following loss function directly.
arg min Y
n X kxi yi k2 :
ð9Þ
i¼1
For the third objective, we compute the direction in which the variance of projection is as large as possible. In other words, we plan to optimize the following objective.
2 n n X 1X arg max zi zj Z n i¼1 j¼1
ð10Þ
As in most cases, we constrain that the transformation matrix is orthogonal, i.e. PTP = I, and add the parameters a and b to balance the effects of three terms. In other words, we formulate SLPS as follows.
arg
min
P T P¼I;Y;Z¼P T Y
ð1 aÞ
2 n n n X X 1X 2 W ij kyi yj k þ a kxi yi k b zi zj n j¼1 i¼1 i¼1 j¼1
n X n X i¼1
2
ð11Þ
Here, a and b are two balance parameters and I is an identity matrix. We can see that the first two terms aim to derive the shrunk pattern and the last term projects the shrunk pattern to a low dimensional space. Equivalently, the problem in Eq. (11) can be formulated, with respect to P and Y, as follows.
2 n X n n n n X X X 1X arg min ð1 aÞ W ij kyi yj k2 þ a kxi yi k2 b PT yi P T yj T n P P¼I;Y i¼1 j¼1 i¼1 i¼1 j¼1
ð12Þ
We now use the toy experimental results in Fig. 1 to explain why the pattern could shrink along the manifold and SLPS could derive more practical low dimensional representations by minimizing Eq. (11). Intuitively, by optimizing Eq. (11), we can characterize the nonlinear structure of data and considers the requirements of following clustering by learning shrunk pattern, i.e., by minimizing the first two objectives. We can derive a subspace by the joint effects of three terms. (1) The minimization of the first term in Eq. (11) tends to compress the representations of nearby samples as close as possible. More concretely, if two points xi and xj are near (Commonly, they are in the same cluster.), their corresponding weight Wij (or Wji) is large and the minimization of the first term tends to derive similar yi and yj. On the contrary, if xi and xj are not connected, Wij = Wji = 0 and the minimization of the first term does not concern them. (2) The second term aims to keep consistency between shrunk pattern and the original data. The third term is for projecting the data to the direction of maximum variance. (3) Considering the effects of the first two terms in Eq. (11), we can compute the shrunk pattern by optimizing them together. It not only reflects the requirement of characterizing nonlinear structure, but also considers the requirement of learning subspace for clustering. More concretely, the first term aims to shrink nearby points to the same pattern. Nevertheless, if we only minimize the first object, all points will concentrate to a single center. By combining constraint in the second term, samples belonging to different clusters will shrink to their own clusters. Take Fig. 1 as an example, the first objective function aims to shrink nearby data on the two center, while the second object requires that the shrunk pattern is consistent with original data. The joint effects of two terms guarantee that the pattern shrinks to their own centers. (4) We employ parameters a and b to balance the effects of three terms. As mentioned above, by tuning a, we balance the effects of concentrating the sample within the same cluster and keeping separateness among points belonging to different clusters. By adjusting b, we balance the effects of computing shrunk pattern and deriving low dimensional embedding. (5) As seen from Fig. 1, when b = 0, the smaller a is, the more dominating the first term becomes, resulting more patterns shrinking. Certainly, this does not mean that the smaller a is better, since we should also pay attention to the second term. Moreover, as seen from Fig. 1d, since the points are clustered, the maximum variance direction for projection (the third objective function) is horizontal and SLPS will find this direction, in which the data can be projected and clustered after projecting. 3.2. Solution In this subsection, we will solve the formulated problem. For convenience, denote L = D W, where D is a diagonal matrix P with entries Dii ¼ j W ij . In fact, it is the Laplacian matrix of the above defined k-nearest graph with weight matrix W (Belkin et al., 2006). The objective function in Eq. (12), denoted by LðP; YÞ, becomes
876
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
LðP; YÞ ¼ ð1 aÞtrðYLY T Þ þ atr½ðX YÞT ðX YÞ btrðPT YHY T PÞ ¼ ð1 aÞtrðYLY T Þ þ atr½ðX YÞT ðX YÞ btrðY T PPT YHÞ
ð13Þ
1 11T n
where H ¼ I is the centralization matrix. Since this equation is not convex with respect to P and Y, it is difficult to find both optimal solutions simultaneously. We apply the alternative optimization strategy to find the approximated solutions. More concretely, we fix one parameter and optimize the other, alternatively. 3.2.1. Fixing P and optimizing Y We first fix P and set the derivative of LðP; YÞ with respect to Y to zero,
@L ¼ 2ð1 aÞYL þ 2aðY XÞ 2bMYH ¼ 0 @Y
ð14Þ
where M = PPT. This problem cannot be directly solved by the spectral decomposition technique as in other subspace learning methods, for example, PCA, LDA and LPP. Nevertheless, it is a linear problem with respect to all the elements of Y, i.e., Yij. More concretely, we expand X and Y by connecting each row of X and Y and define the expanded vector as follows.
^x ¼ ½X 11 ; X 12 ; . . . ; X 1r ; X 21 ; . . . ; X 2r ; . . . ; X n1 ; . . . ; X nr ^ ¼ ½Y 11 ; Y 12 ; . . . ; Y 1r ; Y 21 ; . . . ; Y 2r ; . . . ; Y n1 ; . . . ; Y nr y ^, we have the following equation. By reformulating Eq. (14) in terms of ^ x and y
^ þ ay ^ a^x bðH MÞy ^¼0 ½ð1 aÞL Iy
ð15Þ
where represents the Kronecker product of two terms. Considering the problem in Eq. (15), the optimal solutions can be directly derived as follows.
^ ¼ af½ð1 aÞL I þ aI bðH MÞg1 ^x: y
ð16Þ
3.2.2. Fixing Y and optimizing P When we fix Y and optimize P, Eq. (12) becomes
arg max trðPT YHY T PÞ P T P¼I
ð17Þ
This is equivalent to the objective function of PCA. Thus, it can be effectively solved by spectral decomposition of YHYT. In summary, we have proposed an approach to solve the optimization problem of SLPS in an alternative way. Comparing with traditional non-iterative subspace learning methods, such as PCA, its computational complexity is a little high, since it needs several times of iterations to obtain the approximated solutions. To reduce the computational complexity, we will show a property of SLPS and propose one-shot SLPS, which can derive an approximated solution without iteration. b is formed by reshaping y ^ to a (r n) matrix, just as the inverse process of reshaping Y to y ^, the following propAssume Y osition shows a property when b ? 0. b ¼ aX½ð1 aÞL þ aI1 Proposition 1. When b ! 0; Y b ¼ aX½ð1 aÞL þ aI1 h ^ ¼ af½ð1 aÞL I þ aIg1 ^ x. Equivalently, Y Proof. when b ! 0; y As seen from above derivation, we can get the same result by solving the problem in Eq. (14) directly. Although this proposition is simple, we can approximately solve the problem in Eq. (12) in another way and derive the one-shot SLPS by solving the following optimization problems in sequence.
Y ¼ aX½ð1 aÞL þ aI1
ð18Þ
arg max trðPT YHY T PÞ
ð19Þ
P T P¼I
Note that, the one-shot SLPS is first to compute the shrunk pattern and then apply PCA on the shrunk pattern to derive the final embedding Z. Although the computing and projecting of shrunk pattern are separated in one-shot SLPS, our method could also perform well since (1) the original data are shrunk to their own clusters and it is effective to use PCA to compute the subspace directly and (2) the pattern shrunk technique can also represent the nonlinear structure of original data. Thus, it is reasonable to use PCA on the shrunk patterns. More importantly, compared with solving the problems in Eqs. (15) and (17) alternatively, in one-shot SLPS, the computational requirement is reduced since we can find an approximate solution to the problem in Eq. (12) in one time, without iteration. SLPS, however, can be regarded as solving the problem in one-shot SLPS for several iterations. Thus, in the following experiments, we all use one-shot SLPS for comparison. In summary, the basic procedure of SLPS is summarized in Table 2.
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
877
Table 2 SLPS procedure. Input: Data matrix X; Parameter d, k, r, a and b Output: Embedding Z, shrunk pattern Y and projection matrix P (1) To compute the weight matrix W by Eq. (7); Repeat step (2) and step (3) until convergence ^ in Eq. (16) (2) To fix P and compute the shrunk pattern y (3) To fix Y and compute the projection P by solving the problem in Eq. (17); end T (4) To compute the embedding Z by Z = P Y
3.3. Performance analysis We will analyze the proposed approach in several different aspects. First, we would like to discuss the problem of parameter selection. There are four important parameters: the dimensionality of subspace, i.e., d, the balance parameters a, b and the width parameter r. Parameter determination is an essential task in most of the learning algorithms (Zelnik-manor & Perona, 2004). The parameter r is very important since it determines how well the constructed graph can model the low dimensional manifold. As in many related approaches, we apply the technique in Zelnik-manor and Perona (2004) to tune this parameter automatically. More concretely, instead of selecting a single parameter r, a local scaling parameter ri is calculated for each data point xi by simply measuring the similarity of xi and its kth neighbors. See more details in Zelnik-manor and Perona (2004). When we refer to the parameter a, it also affects the performance of SLPS, especially for toy examples, e.g., the results in Fig. 1. Among various kinds of methods, grid search is probably the simplest and most widely used one for unsupervised learning. Since we have constrained that 0 < a < 1, the grid search technique is applied in the following experiments. More concretely, we determine a by searching the grid {0.05, 0.1, 0.15, . . . , 0.95} and choose the best one. In one-shot SLPS, since b is omitted, we need not to determine this parameter. The second concern is about the computational complexity since it is important for real applications. Considering the procedure in previous subsection, we know that the shrunk pattern Y and transformation matrix P can be directly computed by solving the problem in Eqs. (18) and (19), without iteration. The computational complexity is O((max (N, r))3). However, this computational complexity is still a little high to solve the problem involving very high dimensional data set with very large volume. We have also used the numerical methods (Vogt, Mizaikoff, & Tacke, 2001), which have accelerated PCA in hyperspectral image processing. Another possible way is to conduct K-means for clustering at first and then use the centroid of clusters as the representative data to reduce the sample size. Finally, the pattern shrinking technique can be regarded as a preprocessing technique. Other clustering methods are required to compute the final results. As shown in the toy example, since SLPS can effectively shrink the pattern, it will be greatly beneficial to the next clustering. For illustration, we employ two different kinds of clustering approaches, i.e., Kmeans and Spectral Clustering (Normalized Cut (Ncut) (Shi & Malik, 2000) as an example), to cluster the samples in the subspace. For brief, they are named as PsKm and PsNcut respectively. One point should also be highlighted here is that Shi et al. have also proposed another ‘‘shrinking’’ based approach (Shi et al., 2005). It uses Newtons Universal Law of Gravitation to compute the similarity of any two points and then clusters all data. Different from this approach, our research aims to learn a subspace for clustering and it can inherit the advantages of subspace learning. In addition, there are several subspace segmentation approaches which are also related to our research. Vidal et al. have proposed an approach, named as Generalized Principal Component Analysis (GPCA) (Vidal, Ma, & Sastry, 2005), to segment an unknown number of subspaces of unknown and varying dimensions. It is mainly used to solve computer vision problems. Ho et al have presented the Ksubspace algorithm (Ho, Yang, Lim, Lee, & Kriegman, 2003) to learn several subspaces for clustering a set of images of 3-D objects. Moreover, Tipping and Bishop have also formulated PCA in a probabilistic framework and computed several subspaces by EM algorithms (Tipping & Bishop, 1999). Comparing with our approach, these methods mainly focus on segmenting subspaces. Our research, however, learns only one subspace. They have different goals, although they all focus on subspace learning. 4. Experiments In the following experiments, we have employed five different kinds of data sets, including the face image set: Orl,1 the handwritten digit set: Mnist,2 the object image set: Coil20,3 the voice data set: Isolet4 and the web text set: Newsgroup.5
1 2 3 4 5
http://www.cl.cam.ac.uk/research/dtg/attarchive/facedata/base.html. http://yann.lecun.com/exdb/mnist/. http://www1.cs.columbia.edu/CAVE/research/softlib/coil-20.html. http://archive.ics.uci.edu/ml/datasets/ISOLET. http://people.csail.mit.edu/jrennie/20Newsgroups/.
878
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
The character of each data is as follows. We select 100 samples per class from the Mnist data set. Thus, there are totally 1000 images. For Isolet, we select the fifth subset. The Newsgroup data set is preprocessed by Yu et al. (2006). It contains 8014-dimensional TFIDF features and there are totally four different categories, covering autos, motorcycles, baseball and hockey. Since the dimensionality is very high, we only select 200 samples per category. Some typical images from the first three data are shown in Fig. 2. For illustration, we compare our algorithm (We employ One-shot SLPS and take ’SLPS’ for brief) with state of the art subspace learning approaches: PCA, LdaKm and LPP. Since Kmeans is commonly used and Spectral Clustering also employs the similar graph to characterize data structure, we compare PsKm and PsNcut with these approaches (Kmeans and Ncut). Although SLPS is evoked by semi-supervised learning, we have not compared with them since they focus on classification but we try to cluster data. Besides, the parameter a is determined by grid search in the following experiments. Since
Fig. 2. Sample figures from three image data sets. From top to bottom: Orl, Mnist, Coil20.
(a)
(b)
(c)
(d)
(e)
5
5
5
5
5
0
0
0
0
0
−5
−5
−5
−5
−5
−2
0
2
−2
0
2
−2
0
2
−2
0
2
−2
0
2
Fig. 3. Toy Examples. The blue line shows the projection direction. (a) Original data and ideal projection direction; (b) original data and PCA projection direction; (c) original data and LdaKm projection direction; (d) original data and LPP projection direction; and (e) shunk pattern and SLPS projecting direction.
(a)
(b)
(c)
(d)
(e)
5
5
5
5
5
0
0
0
0
0
−5
−2
0
2
−5
−2
(f)
0
2
−5
−2
(g)
0
2
−5
−2
(h)
0
2
−5
(i)
5
5
5
5
0
0
0
0
0
−2
0
2
−5
−2
0
2
−5
−2
0
2
−5
−2
0
0
2
(j)
5
−5
−2
2
−5
−2
0
2
Fig. 4. Toy examples with different b. The blue line shows the projection direction derived by SLPS with different b. From (a) to (j), b are chosen from 0, 0.01, 0.06, 0.11, . . . , 0.41.
879
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
0.75 0.7
0.8
0.65 0.75
0.7
0.55
NMI
Acc
0.6
0.5
0.65 Kmeans PcaKm LdaKm LppKm PsKm Ncut PcaNcut PsNcut SlpsNcut
0.45 0.4 0.35 0.3
0
50
100
150
Kmeans PcaKm LdaKm LppKm PsKm Ncut PcaNcut PsNcut SlpsNcut
0.6
0.55
0.5
200
0
50
Low dimensionality d
100
150
200
Low dimensionality d
Fig. 5. Clustering results of different methods on the Orl data set with different embedding dimensionality, i.e., d; left: The Acc results; right: The NMI results.
0.65
0.6
0.6 0.55 0.55
NMI
Acc
0.5 0.5 Kmeans
0.45
Kmeans PcaKm
0.45
PcaKm
LdaKm
LdaKm LppKm
0.4
LppKm
0.4
PsKm
PsKm
Ncut
Ncut PcaNcut
0.35 0.3
PcaNcut
0.35
PsNcut
PsNcut SlpsNcut
0
50
100
150
SlpsNcut
200
0.3 0
50
Low dimensionality d
100
150
200
Low dimensionality d
Fig. 6. Clustering results of different methods on the Mnist data set with different embedding dimensionality, i.e., d; left: The Acc results; right: The NMI results.
parameter determination is still an open problem, other parameters, such as r and k in Eq. (7), are determined by the same way as in other related literatures (Zelnik-manor & Perona, 2004; van der Maaten, Postma, & van den Herik, 2009). More concretely, r is computed by local scaling and k = 7 in all experiments. There are mainly three groups of experiments. The first group contains experimental results on toy examples. We also vary the parameter b and compare SLPS with one-shot SLPS intuitively. In the second group, we compare our methods with other approaches quantitatively. Finally, we vary the parameter a and show its influence on the clustering performance. The first experiment is to show the effectiveness of our SLPS approach on a toy example shown in Fig 3. It contains 80 examples that is sampled from two distinct Gaussian distribution and can be separated linearly. We employ PCA, LdaKm, LPP and SLPS to determine the projecting directions (the blue6 dashed line). As seen from this figure, it is obvious that SLPS could shrink the pattern and project the data in a separate way. Other methods fail to discover the right projection directions.
6
For interpretation of color in Figs. 3 and 4, the reader is referred to the web version of this article.
880
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
Additionally, since one-shot SLPS is obtained when beta = 0, we choose this parameter from 0, 0.01, 0.06, 0.11, . . . , 0.41, and draw the corresponding projection in Fig. 4 intuitively The other setting are the same as in previous experiments. As seen from above result, we can draw following conclusion. (1) One-shot SLPS (b = 0) could approximate the solution of SLPS in a proper way. (2) With different b, the effects of pattern shrinking and subspace learning are balanced. If it is too large, it will tend to derive the similar projection of PCA. The second group of experiments is to compare our methods, i.e., PsKm and PsNcut, with other approaches quantitatively. For brief, we name PCA + Kmeans as PcaKm, LPP + Kmeans as LppKm, PCA + Ncut as PcaNcut. We also combine original SLPS with Ncut as SlpsNcut and report its results. We employ two popular metrics: Accuracy (Acc) (Papadimitriou & Steiglitz, 1982) and Normalized Mutual Information (NMI) (Strehl & Ghosh, 2003) to evaluate the performance of clustering. With different d, the clustering results, which minimize the loss function of Kmeans for 100 runs, are employed for evaluation. The clustering results on five data sets are shown in Figs. 5–9 respectively. The main observations from these figures include: (1) Since our method could fully characterize the low dimensional manifold structure and reflects the requirement for clustering by pattern shrinking, the results of PsKm, PsNcut and SlpsNcut
0.8
0.85
0.75
0.8
0.7 0.65
0.75
NMI
Acc
0.6 0.55
0.65
0.5
Kmeans PcaKm LdaKm LppKm PsKm Ncut PcaNcut PsNcut SlpsNcut
0.45 0.4 0.35 0.3
0.7
0
50
100
150
Kmeans PcaKm LdaKm LppKm PsKm Ncut PcaNcut PsNcut SlpsNcut
0.6
0.55 0.5
200
0
50
Low dimensionality d
100
150
200
Low dimensionality d
Fig. 7. Clustering results of different methods on the Coil20 data set with different embedding dimensionality, i.e., d; left: The Acc results; right: The NMI results.
0.74 0.56 0.72
0.54 0.52
0.7
0.5 0.68
NMI
Acc
0.48 0.46
0.66 0.44
Kmeans PcaKm LdaKm LppKm PsKm Ncut PcaNcut PsNcut SlpsNcut
0.42 0.4 0.38 0.36
100
200
300
400
Low dimensionality d
500
Kmeans PcaKm LdaKm LppKm PsKm Ncut PcaNcut PsNcut SlpsNcut
0.64
0.62
600
0.6
100
200
300
400
500
600
Low dimensionality d
Fig. 8. Clustering results of different methods on the Isolet data set with different embedding dimensionality, i.e., d; left: The Acc results; right: The NMI results.
881
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
in Figs. 5–9 are better than their corresponding clustering methods and other subspace learning methods in most cases. (2) When the data have distinct manifold structures, for example, the Coil20 data, the improvements are more significant. It means that our method could characterize the manifold structure effectively. (3) By comparing the results of PsKm and PsNcut with Kmeans and Ncut, we know that pattern shrinking is an effective preprocessing technique. For example, on the Newsgroup data, the clustering accuracy is much higher when we reduce the dimensionality. (4) For different dimensionality d, since Kmeans and Ncuts do not involve dimensionality, their results are not changed. Besides, the subspace learning methods also have different performances. Nevertheless, dimensionality reduction is helpful in most cases. In the third group, we aim to show the influence of parameter a. We have fixed d and employed PsKm on these sets. With different a, changing from 0.1 to 0.9, each run is also repeated for 100 times. Other parameters are determined as in previous experiments. The Acc and NMI results are shown in Figs. 10 and 11. There are mainly three observations from Figs. 10 and 11. (1) The performance of PsKm depends on the selection of a. When a is suitably selected, PsKm could achieve very high accuracy. (2) With different a, the performance of PsKm is also related to the characters of data sets. For example, the variance between the largest and smallest values for Coil20 is much smaller than that for Newsgroup. (3) For different data, we should select different a to achieve the highest accuracy. Take Coil20 as an example, we should choose the small a. For some data, such as Newsgroup, the larger a is better.
0.9
0.7 0.65
0.8
0.6
0.7
0.6
NMI
Acc
0.55 0.5 0.45
0.5
Kmeans PcaKm LdaKm LppKm PsKm Ncut PcaNcut PsNcut SlpsNcut
0.4
0.3 0
50
100
Kmeans PcaKm LdaKm LppKm PsKm Ncut PcaNcut PsNcut SlpsNcut
0.4 0.35 0.3
150
0.25
0
50
Low dimensionality d
100
150
Low dimensionality d
Fig. 9. Clustering results of different methods on the Newsgroup data set with different embedding dimensionality, i.e., d; left: The Acc results; right: The NMI results.
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.8 0.7
Acc
0.6 0.5 0.4 0.3 0.2 0.1 0
Orl
Mnist
Coil20
Isolet
Newsgroup
Fig. 10. Clustering accuracy (Acc) of PsKm on different data sets with different a.
882
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.8 0.7
NMI
0.6 0.5 0.4 0.3 0.2 0.1 0
Orl
Mnist
Coil20
Isolet
Newsgroup
Fig. 11. Normalized Mutual Information (NMI) of PsKm on various data with different a.
5. Conclusion In this paper, we proposed a new subspace learning model for clustering. It is mainly based on our proposed pattern shrinking technique. The main idea is to project the original data into a subspace, which is more convenient for clustering. Compared with the state-of-the-art subspace clustering methods, it performs better. Moreover, it can also be regarded as an effective preprocessing technique. In other words, we can improve the performance of other clustering algorithms by simply adding our proposed pattern shrinking technique. Future researches mainly include the accelerating issue of our approach. Acknowledgements This work is supported by the 973 Program of China (No. 2013CB329503), NSFC China (No. 61005003, 60975038) and Zhejiang Provincial Natural Science Foundation of China (No. Y1110661). References Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Record, 27(2), 94–105. Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). Optics: Ordering points to identify the clustering structure. SIGMOD Record, 28(2), 49–60. Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1996). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 711–720. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7, 2399–2434. van der Maaten, L. J. P., Postma, E. O., & van den Herik, H. J. (2009). Dimensionality reduction: A comparative review. Tech. rep., Tilburg University. Ding, C., & Li, T. (2007). Adaptive dimension reduction using discriminant analysis and k-means clustering. In ICML ’07: Proceedings of the 24th international conference on machine learning (pp. 521–528). New York, NY, USA: ACM. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). Berlin: Springer. Gondek, D., & Hofmann, T. (2007). Non-redundant data clustering. Knowledge and Information Systems, 12(1), 1–24. Guha, S., Rastogi, R. & Shim, K. (2000). Rock: A robust clustering algorithm for categorical attributes. In Proceedings of the 15th international conferernce on data engineering. Gu, Q., & Zhou, J. (2009). Subspace maximum margin clustering. In Proceedings of the 18th ACM conference on information and knowledge management. CIKM ’09 (pp. 1337–1346). New York, NY, USA: ACM. He, X., & Niyogi, P. (2003). Locality preserving projections. Advances in neural information processing systems (Vol. 16). MIT Press. Ho, J., Yang, M. Hsuan, Lim, J., Lee, K. Chih & Kriegman, D. (2003). Clustering appearances of objects under varying illumination conditions. In CVPR (pp. 11– 18). Hofmann, T., & Buhmann, J. M. (1998). Active data clustering. In Advances in neural information processing systems (Vol. 10, pp. 528–534). Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computation Survey, 31(3), 264–323. Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. Wiley-Interscience. Li, T., Ma, S., & Ogihara, M. (2004). Document clustering via adaptive subspace iteration. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’04 (pp. 218–225). New York, NY, USA: ACM. Papadimitriou, C. H., & Steiglitz, K. (1982). Combinatorial optimization: Algorithms and complexity. Upper Saddle River, NJ, USA: Prentice-Hall, Inc. Robles-Kelly, A., & Hancock, E. R. (2004). Spanning tree recovery via random walks in a riemannian manifold. In CIARP (pp. 303–311). Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905. Shi, Y., Song, Y., & Zhang, A. (2005). A shrinking-based clustering approach for multidimensional data. IEEE Transactions on Knowledge and Data Engineering, 17(10), 1389–1403. Strehl, A., & Ghosh, J. (2003). Cluster ensembles – A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. Tipping, M. E., & Bishop, C. M. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2), 443–482.
C. Hou et al. / Information Processing and Management 49 (2013) 871–883
883
Vidal, R., Ma, Y., & Sastry, S. (2005). Generalized principal component analysis (gpca). IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12), 1945–1959. Vogt, F., Mizaikoff, B., & Tacke, M. (2001). Numerical methods for accelerating the PCA of large data sets applied to hyperspectral imaging. In Proceedings of SPIE, chemometrics for environmental sensing (pp. 215–226). Xu, R., & Wunsch, D. I. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678. Ye, J., Zhao, Z., & Wu, M. (2007). Discriminative k-means for clustering. In NIPS. Yu, K., Bi, J., & Tresp, V. (2006). Active learning via transductive experimental design. In ICML ’06: Proceedings of the 23rd international conference on machine learning (pp. 1081–1088). New York, NY, USA: ACM. Zelnik-manor, L., & Perona, P. (2004). Self-tuning spectral clustering. Advances in neural information processing systems (Vol. 17, pp. 1601–1608). MIT Press. Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Schölkopf, B. (2003). Learning with local and global consistency. In Advances in neural information processing systems. Cambridge, MA: MIT Press.