Graph Regularized Low-Rank Representation for Submodule Clustering
Journal Pre-proof
Graph Regularized Low-Rank Representation for Submodule Clustering Tong Wu PII: DOI: Reference:
S0031-3203(19)30446-7 https://doi.org/10.1016/j.patcog.2019.107145 PR 107145
To appear in:
Pattern Recognition
Received date: Revised date: Accepted date:
5 April 2019 4 November 2019 27 November 2019
Please cite this article as: Tong Wu, Graph Regularized Low-Rank Representation for Submodule Clustering, Pattern Recognition (2019), doi: https://doi.org/10.1016/j.patcog.2019.107145
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.
Highlights • We present a new unified framework for submodule clustering. • The approach explicitly considers the manifold structure of data. • A nonlinear extension is proposed for manifold clustering using kernel methods. • Experimental results demonstrate the effectiveness of the proposed methods.
1
Graph Regularized Low-Rank Representation for Submodule Clustering Tong Wua,∗ a Department
of Electrical and Computer Engineering, Rutgers University, NJ, USA
Abstract In this paper a new submodule clustering method for imaging (2-D) data is proposed. Unlike most existing clustering methods that first convert such data into vectors as preprocessing, the proposed method arranges the data samples as lateral slices of a third-order tensor. Our algorithm is based on the union-offree-submodules model and the samples are represented using t-product in the third-order tensor space. First, we impose a low-rank constraint on the representation tensor to capture the principle information of data. By incorporating manifold regularization into the tensor factorization, the proposed method explicitly exploits the local manifold structure of data. Meanwhile, a segmentation dependent term is employed to integrate the two pipeline steps of affinity learning and spectral clustering into a unified optimization framework. The proposed method can be efficiently solved based on the alternating direction method of multipliers and spectral clustering. Finally, a nonlinear extension is proposed to handle data drawn from a mixture of nonlinear manifolds. Extensive experimental results on five real-world image datasets confirm the effectiveness of the proposed methods. Keywords: Clustering, kernel methods, manifold regularization, submodule clustering, tensor nuclear norm, union of free submodules.
∗ Corresponding
author Email address:
[email protected] (Tong Wu)
Preprint submitted to Journal of LATEX Templates
November 28, 2019
1. Introduction The need for representing and processing high-dimensional data has been increasing recently in many practical applications such as machine learning and computer vision. Fortunately, the intrinsic dimension of high-dimensional data 5
is much smaller than the dimension of the ambient space. One prevalent lowdimensional structure is that data lie near a union of low-dimensional subspaces. Recovering such low-dimensional subspaces requires clustering data samples into different groups such that each group corresponds to one subspace, which is referred to as subspace clustering. This problem has numerous applications, such
10
as image clustering [1], motion segmentation [2], and temporal video segmentation [3]. Over the past few years, various subspace clustering approaches have been developed, including iterative methods [4, 5], algebraic methods [3], statistical methods [6–8], and spectral clustering based methods [1, 2, 9–19]. Among them,
15
spectral clustering based methods have been popular due to their great success in many applications. These methods involve learning an affinity matrix (or graph) from the data and then applying spectral clustering on the graph to infer the clustering of data. Ideally, the affinity matrix should be block diagonal, i.e., the inter-cluster affinities are all zeros. A significant fraction of the
20
literature is devoted to learn a “good” affinity matrix. Concretely, Sparse Subspace Clustering (SSC) [2] seeks a sparse representation of the data. Low-Rank Representation (LRR) [1] imposes a low-rank constraint on the representation coefficient matrix to capture the global structure of data. In Least Squares Regression (LSR) [9], the Frobenius norm is used to regularize the coefficient
25
matrix, which can yield a block diagonal solution when the underlying subspaces are independent. To deal with imaging data, typical subspace clustering approaches tend to vectorize the data samples such that the conventional algorithms designed for vectorial data can still be applied. Although this vectorization approach works
30
well in many cases, it breaks the multidimensional structure of the images and
3
reduces the reliability of subsequent processing. There have been several recent works that have tried to exploit spatial aspects of imaging data by using multilinear algebra tools [20–22], and these approaches have been shown to outperform the vectorization-based approaches in myriad tasks [23–27]. In particular, 35
motivated by the t-product [21, 22], which is a matrix-like multiplication for third-order tensors and provides a better way of exploiting the intrinsic structure of a third-order tensor [25], one can assume that the images in each cluster approximately lie near a free submodule. The data samples can be represented in a t-linear combination manner by imposing some constraints on the repre-
40
sentation tensor, from which an affinity matrix can be built and used for final clustering. In [25], a group sparsity constraint is imposed on the representation tensor, while a low-rank tensor model that incorporates a weighted sparsity constraint on the representation tensor is proposed in [26]. In the literature, encouraging results have been reported for image clustering
45
based on the union of free submodules (UoFS) model [25, 26]. But there remains a lot of room for improvement. First, the intrinsic manifold structure of data is not explicitly considered in none of these works. Meanwhile, existing works on manifold learning have shown that considering both the global and local geometrical structure of data can lead to improved performance in the context
50
of a number of applications [11, 16, 28–30]. We conjecture that the clustering performance could be improved if we could explicitly account for such intrinsic manifold structure. Second, existing submodule clustering works use spectral clustering as a post-processing step on the graph generated from a coefficient tensor, but the relationship between the coefficient tensor and the segmentation
55
of data is seldom considered, which can lead to sub-optimal results. Inspired by the recent work [18], we hypothesize that by making the clustering result dependent on the generation of the optimal representation tensor, we will be able to obtain better clustering results. Another limitation of the UoFS model is the individual t-linearity of its constituent submodules, which limits its usefulness
60
for data drawn from a mixture of nonlinear manifolds [10, 17, 31, 32]. On the other hand, while kernel subspace clustering [10, 17] can handle manifold 4
data, the vectorization preprocessing in such methods can potentially lead to information loss and cause performance degradation. To overcome the limitations of existing methods, we propose a new submod65
ule clustering algorithm, termed clustering-aware Laplacian regularized low-rank submodule clustering (CLLRSmC), for improved image clustering based on the UoFS model. Our work is most closely related to that in [26]. Both these works are based on the UoFS model and use the tensor nuclear norm [24, 27] to impose a tensor low-rank condition on the representation tensor. However, [26] incorpo-
70
rates a structure constraint on each frontal slice of the representation tensor. In contrast, we introduce a manifold regularization term in the objective function of CLLRSmC to exploit the manifold structure of data. Meanwhile, a segmentation dependent term coupling the representation tensor and the segmentation matrix is incorporated into the optimization program, which ensures consistency
75
between the representation tensor and the submodule segmentation result. In order to deal with images drawn from a union of nonlinear manifolds, we propose a kernel extension of CLLRSmC. Unlike previous works [10, 17] that map the vectorized data from the input space into some higher-dimensional feature space using a kernel-induced mapping [33], our method maps each column of
80
an image separately to the feature space such that the mapping of an image in the input space is still an image in which each column resides in the feature space. By arranging these mapped images as lateral slices in a third-order tensor, we model them as lying near a mixture of free submodules. Our experimental results confirm the superiority of the proposed methods in comparison
85
to a number of state-of-the-art alternatives on several benchmark datasets. In summary, our main contributions in this paper is threefold: 1) We propose a new submodule clustering method, termed CLLRSmC, which explicitly considers the intrinsic manifold structure of data and integrates the two separate stages, i.e., learning a low-rank representation tensor
90
and performing spectral clustering, into a unified optimization framework. 2) An efficient optimization procedure is presented to solve the CLLRSmC optimization problem using a combination of an alternating direction method 5
of multipliers with spectral clustering. 3) We provide a nonlinear extension of CLLRSmC, termed KCLLRSmC, 95
which combines manifold regularization and kernel methods for manifold clustering. We derive an iterative algorithm to solve the problem of KCLLRSmC using the kernel trick [34], which avoids explicit mapping of data to the feature space. The rest of the paper is organized as follows. Section 2 introduces the no-
100
tations and basic multilinear algebra facts that form the basis of this paper. Section 3 briefly reviews the related works. In Section 4, we mathematically formulate the CLLRSmC problem and give the details of our algorithm. In Section 5, we discuss our nonlinear extension of CLLRSmC. Experimental results are presented in Section 6, which is followed by concluding remarks in Section 7.
105
2. Notations and Preliminaries In this section, we introduce the notations and give the basic definitions used throughout the paper. We use calligraphic uppercase letters for tensors, e.g., A, bold uppercase letters for matrices, e.g., A, bold lowercase letters for vectors, e.g., a, and non-bold letters for scalars, e.g., a. The i-th element of
110
a vector a is denoted by ai and the (i, j)-th element of a matrix A is denoted by ai,j . The i-th row and j-th column of a matrix A are denoted by ai and aj , respectively. The zero matrix and the identity matrix are denoted by 0 and I of appropriate dimensions, respectively. The inner product between two matrices A, B ∈ Cn1 ×n2 is hA, Bi = tr(AT B), where AT denotes the conjugate
115
transpose of A and tr(·) denotes the trace operation. For a third-order tensor A ∈ Cn1 ×n2 ×n3 , we denote its (i, j, k)-th entry as ai,j,k and use Matlab notation A(i, :, :), A(:, i, :) and A(:, :, i) to denote its i-th horizontal, lateral and frontal slices, respectively. More often, the frontal slice A(:, :, i) is concisely denoted
as A(i) . The (i, j)-th mode-1, mode-2 and mode-3 fibers are represented by 120
A(:, i, j), A(i, :, j) and A(i, j, :), respectively. The inner product of two tensors Pn3 A, B ∈ Cn1 ×n2 ×n3 is defined as hA, Bi = k=1 hA(k) , B(k) i. We denote dae as 6
the nearest integer greater than or equal to a scalar a.
125
Some norms of vector, matrix and tensor are used. The Frobenius and qP 2 infinity norms of a tensor A are defined as kAkF = i,j,k |ai,j,k | and kAk∞ =
maxi,j,k |ai,j,k |, respectively. These norms reduce to matrix norms if A is a pP 2 matrix. The `2 norm of a vector a is represented by kak2 = i |ai | . The `1 P norm of a matrix A is denoted by kAk1 = i,j |ai,j |.
For A ∈ Rn1 ×n2 ×n3 , we use Ab = fft(A, [ ], 3) to denote the Fourier transform
of A along the third dimension. Similarly, one can compute A from Ab using the
b i.e., A = ifft(A, b [ ], 3). The block inverse Fourier transform along mode-3 of A, circulant matrix of a tensor A has size n1 n3 × n2 n3 and is A(1) A(n3 ) . . . A(2) A(1) . . . A(3) A(2) bcirc(A) = .. .. .. .. . . . . A(n3 )
A(n3 −1)
...
A(1)
defined as .
(1)
We also define the block vectorizing and its inverse operation as bvec(A) = [A(1) ; A(2) ; . . . ; A(n3 ) ] ∈ Rn1 n3 ×n2 and bvfold(bvec(A)) = A, respectively. Definition 1 (t-product [22]). Let A ∈ Rn1 ×n2 ×n3 and B ∈ Rn2 ×n4 ×n3 , then the t-product A ∗ B is defined to be a tensor of size n1 × n4 × n3 , i.e., C = A ∗ B = bvfold(bcirc(A)bvec(B)).
(2)
It can also be defined based on the circular convolution, i.e., C(i, p, :) = 130
n2 X j=1
A(i, j, :) ◦ B(j, p, :),
(3)
where ◦ denotes circular convolution between two fibers of the same size. Note that the t-product in the original domain is equivalent to matrix multiplib (k) = A b (k) B b (k) , k = 1, . . . , n3 . cation in the Fourier domain; that is, C
Definition 2 (Tensor transpose). The conjugate transpose of a tensor A ∈
Cn1 ×n2 ×n3 is the n2 × n1 × n3 tensor AT obtained by conjugate transposing 135
each frontal slice of A and then reversing the order of transposed frontal slices 2 through n3 . 7
Definition 3 (Identity tensor). The identity tensor I ∈ Rn×n×n3 is a tensor whose first frontal slice is the n × n identity matrix and all other frontal slices are zero. 140
Definition 4 (Orthogonal tensor). A tensor Q ∈ Rn×n×n3 is orthogonal if it satisfies Q ∗ QT = QT ∗ Q = I.
Definition 5 (f-diagonal tensor). A tensor is called f-diagonal if each frontal slice of the tensor is a diagonal matrix. Theorem 1 (tensor singular value decomposition (t-SVD) [21, 27]). Let A ∈
Rn1 ×n2 ×n3 , the t-SVD is given by
A = U ∗ S ∗ VT ,
(4)
where U and V are orthogonal tensors of size n1 × n1 × n3 and n2 × n2 × n3 , 145
respectively, and S is an f-diagonal tensor of size n1 × n2 × n3 . One can obtain t-SVD efficiently via matrix SVDs in the Fourier domain. Let b (k) , S b (k) , V b (k) ] = SVD(A b (k) ) for k = 1, 2, . . . , n3 . We Ab = fft(A, [ ], 3) and [U can compute t-SVD by
b [ ], 3), S = ifft(S, b [ ], 3), V = ifft(V, b [ ], 3). U = ifft(U,
(5)
In fact, it has been proved that one only needs to run d n32+1 e matrix SVDs because of the conjugate symmetry in the Fourier domain [27]. Definition 6 (Tensor multi rank and tubal rank [22, 24]). The tensor multi rank of A ∈ Rn1 ×n2 ×n3 is a vector r ∈ Rn3 with its k-th entry equal to the rank 150
b The tensor tubal rank, denoted by rankt (A), is of the k-th frontal slice of A.
defined as the number of nonzero singular tube fibers of S, where S is from the
t-SVD of A = U ∗ S ∗ V T , i.e., rankt (A) = #{i, S(i, i, :) 6= 0} = maxk rk .
Definition 7 (Tensor nuclear norm [24, 27]). Let A = U ∗ S ∗ V T be the
155
t-SVD of A ∈ Rn1 ×n2 ×n3 , then the tensor nuclear norm of A is defined as the Pr sum of the tensor singular values, i.e., kAk~ = i=1 si,i,1 , where r = rankt (A). 8
twist
squeeze
Figure 1: The twist and squeeze operations for an image.
2.1. Linear Algebra with t-product Given an image with a size of n1 × n3 , one can twist it into a “page” and form a third-order tensor of size n1 × 1 × n3 (also called “oriented matrices”),
as shown in Fig. 1. We denote Kn3 as the set of all tube fibers with a size of
160
1 × 1 × n3 and Knn13 as the set of oriented matrices of size n1 × 1 × n3 . In fact, a tensor with a size of n1 × 1 × n3 can be regarded as a vector of length n1 , where each element is a 1 × 1 × n3 tube fiber. Note that the set Kn3 forms
a ring with unity under the multiplication given by t-product and the regular addition [21]. Thus, a module over a ring generalizes the concept of a vector 165
space over a field, where the corresponding “scalars” are the elements of the ring [25]. The analogue of a subspace is a free submodule. Specifically, an s→ − dimensional (s < n1 ) free submodule Snn31 has a generating set { D i }si=1 such → − → − that any element X ∈ Snn31 can be written as a t-linear combination of the D i s: Ps → → − − X = i=1 D i ∗ ~ci for ~ci ∈ Kn3 [26].
170
3. Related Work Consider a collection of N images X = {Xj ∈ Rn1 ×n3 }N j=1 that belong to L different categories. Traditional subspace clustering approaches [1–3, 5, 9, 11– 16, 18, 19] usually use an “unfolding” process to vectorize each sample Xj into a vector xj ∈ RM , M = n1 n3 , and represent all the vectorized samples in a
matrix X = [x1 , x2 , . . . , xN ] ∈ RM ×N . It is then assumed that the xj ’s lie near a
mixture of L low-dimensional subspaces in RM . The task of subspace clustering is to segment the data samples in X according to their respective subspaces. The
9
Sparse Subspace Clustering (SSC) [2] and Low-Rank Representation (LRR) [1] algorithms exploit the fact that data in a union of subspaces are self-expressive. Mathematically, these algorithms aim to find a matrix Z by solving the following optimization problem: min kZkb + λkEk2F Z,E
s.t.
X = XZ + E,
(6)
where k · kb is the `1 norm in the case of SSC [2] to promote sparseness in Z and the nuclear norm in the case of LRR [1] to guarantee a low-rank solution. Here, λ is a regularization parameter that balances the two terms. In addition, an additional constraint of diag(Z) = 0 is added to (6) in the case of SSC to 175
rule out the trivial solution of Z being an identity matrix. Once the matrix Z is available, spectral clustering [35] can be performed on the affinity matrix defined as W = |Z| + |Z|T to obtain the segmentation of the data X.
=
𝒳𝑗
∗
𝒳
𝒵𝑗
Figure 2: An image belonging to a union of free submodules represented as a t-linear combination of other images belonging to the same free submodule. The samples with the same color in X are from the same free submodule. The green tube fibers in the coefficient tensor correspond to non-zero fibers and the white ones are zero fibers.
While the above vectorization approach has been successful in many applications, it does not consider the spatial structure of the images. In contrast, t-product provides a novel algebraic approach that generalizes the matrix multiplication for third-order tensors [22]. Owing to this operator, one can arrange → − each Xj laterally as an oriented matrix X j ∈ Knn13 and assume the images beL long to a union of L free submodules {`Snn31 }L `=1 of dimensions {s` < n1 }`=1 . Let
10
X denote the data tensor of size n1 × N × n3 obtained by twisting each image into an n1 × 1 × n3 tensor and stacking them laterally. Different from subspace clustering, an image belonging to a union of free submodules can be expressed as a t-linear combination of other images, as illustrated in Fig. 2. The main advantage of this representation is that it allows one image to be represented as shifted copies of other images, thus making any resulting method robust to spatial shifts [25, 26]. In [25], Sparse Submodule Clustering (SSmC) algorithm is proposed by solving the following optimization problem: min kZkF 1 + λ1 kZkF F 1 + λ2 kX − X ∗ Zk2F Z
s.t. zi,i,k = 0, i = 1, 2, . . . , N, k = 1, 2, . . . , n3 ,
(7)
P where Z ∈ RN ×N ×n3 denotes the representation tensor, kAkF 1 = i,j kA(i, j, : P )kF and kAkF F 1 = i kA(i, :, :)kF . Here, λ1 and λ2 are regularization parameters. Motivated by the intuition that images belonging to different free sub-
modules should have lower correlations, the authors in [26] proposed StructureConstrained Low-Rank Submodule Clustering (SCLRSmC), whose learning can be formulated as follows: min kZk~ + λ1 Z
n3 X
k=1
kB Z(k) k1 + λ2 kX − X ∗ Zk2F ,
(8)
where B ∈ RN ×N is a predefined weight matrix associated with the data and denotes the Hadamard product. In general, the matrix B imposes a weighted 180
sparsity constraint on the each frontal slice of Z by penalizing affinities between data samples from different clusters, while rewarding affinities between data samples from the same cluster. Using the resulting tensor Z, one can build an affinity matrix W whose (i, j)-th entry is wi,j = kZ(i, j, :)kF + kZ(j, i, :)kF , which is then used for spectral clustering.
185
4. Clustering-Aware Laplacian Regularized Low-Rank Submodule Clustering In this section, we introduce the proposed CLLRSmC algorithm, in which we consider the local manifold structure of the given data, and a consistency 11
term is incorporated to unify the pipeline of submodule clustering. 190
4.1. Problem Formulation Given a collection of N images of size n1 × n3 , we store them as lateral
slices of a third-order tensor X ∈ Rn1 ×N ×n3 . Our approach is based on the self-expressiveness property of the union of free submodules. First, in order to explore the spatial information within and across frontal slices of the representation tensor, we impose a low multi rank constraint on the representation tensor by minimizing its tensor nuclear norm kZk~ . This is because tensor nuclear norm is the tightest convex relaxation of `1 norm of the tensor multi rank [24]. Next, inspired by recent works on manifold learning [16, 29, 30], we propose to capture the local manifold structure of data by introducing manifold regularization into the learning of Z. To this end, we assume that if two images X (:, i, :) and X (:, j, :) are close in the intrinsic geometry of the data distribution, then the representations of these two images, Z(:, i, :) and Z(:, j, :), are also close to each other with respect to the tensor dictionary X , which leads to minimizing the following quantity: T = =
=
N
N
N
N
1 XX kZ(:, i, :) − Z(:, j, :)k2F γi,j 2 i=1 j=1
1 XX kbvec(Z(:, i, :)) − bvec(Z(:, j, :))k22 γi,j 2 i=1 j=1
N X i=1
bvec(Z(:, i, :))T bvec(Z(:, i, :))δi,i −
N X N X
bvec(Z(:, i, :))T bvec(Z(:, j, :))γi,j
i=1 j=1
T
= tr(bvec(Z)∆bvec(Z) ) − tr(bvec(Z)Γbvec(Z)T ) = tr(bvec(Z)Ψbvec(Z)T ),
(9)
where Γ ∈ RN ×N is a similarity matrix associated with an adjacency graph G which consists of N vertices with the i-th vertex corresponding to X (:, i, :), ∆ P is a diagonal matrix with its diagonal elements defined as δi,i = j γi,j , and Ψ = ∆ − Γ. Let Nα (X (:, i, :)) denote the set of α nearest neighbors of a node X (:, i, :) with respect to some distance metric ξ. Then the symmetric weight 12
matrix Γ associated with the graph G is defined as 1, if X (:, i, :) ∈ Nα (X (:, j, :)) or X (:, j, :) ∈ Nα (X (:, i, :)), γi,j = 0, otherwise.
In this work, we measure the distances between images as follows: let X be the
tensor obtained by normalizing each lateral slice of X such that kX (:, i, :)kF = 1, i = 1, 2, . . . , N ; then, ξ(X (:, i, :), X (:, j, :)) = kX (:, i, :) − X (:, j, :)k2F . We incorporate manifold regularization into the low-rank tensor model and propose the following optimization problem: min kZk~ + λ1 tr(bvec(Z)Ψbvec(Z)T ) + λ2 kX − X ∗ Zk2F , Z
(10)
where λ1 and λ2 are regularization parameters. After obtaining an optimal representation tensor Z by solving (10), we can build an affinity matrix W whose (i, j)-th entry is wi,j = kZ(j, i, :)k2F )
1 2 2 (kZ(i, j, :)kF
+
and apply spectral clustering [35] on W to obtain final clustering,
which solves the following problem: min tr(FT (D − W)F) F
s.t.
F ∈ P,
(11)
where F ∈ RN ×L is a binary matrix indicating the cluster membership of the
195
data points, i.e., ∀i, fi,` = 1 if X (:, i, :), i.e., Xi , lies in submodule `Snn31 and fi,` = P 0 otherwise. Here, D ∈ RN ×N is a diagonal matrix with di,i = j wi,j . The constraint P can be written as P = {F ∈ {0, 1}N ×L : F1 = 1, rank(F) = L},
which denotes the space of all valid segmentation matrices with L clusters. Since each image lies in only one submodule, we must have F1 = 1, where 1 is a vector of all ones of appropriate dimension. Finally, the constraint rank(F) = L is used to guarantee that the number of submodules is equal to L. In practice, problem 200
(11) can be solved by relaxing the constraint F ∈ P to FT F = I. The solution of F for the relaxed problem consists of the eigenvectors of the Laplacian matrix M = D − W associated with its smallest L eigenvalues. Afterwards, k-means clustering is applied to the rows of F to obtain the final binary segmentation matrix. 13
However, sequentially solving (10) and (11) independently may lead to suboptimal clustering results due to the fact that these two pipeline steps highly depend on each other. Notice that both the coefficient tensor Z and the segmentation matrix F carry the segmentation information of the data. It can be observed that if Xi and Xj lie in different submodules, i.e., f i 6= f j , we would like to have Z(i, j, :) = 0 for better clustering. Therefore, we can quantify the disagreement between Z and F as follows: X i,j
n
3 X 1 kZ(i, j, :)k2F ( kf i − f j k22 ) = kΘ Z(k) Z(k) k1 2
k=1
1X wi,j kf i − f j k22 = tr(FT (D − W)F), = 2 i,j
where Θ ∈ RN ×N is a matrix with θi,j =
1 i 2 kf
(12)
− f j k22 . The first term in (12)
is effectively a penalty on Z when points Xi and Xj are in different submodules according to F. By jointly optimizing (10) and (11), we finally pose the problem of clustering-aware Laplacian regularized low-rank submodule clustering (CLLRSmC) in terms of the following optimization problem: min kZk~ + λ1 tr(bvec(Z)Ψbvec(Z)T ) + λ2 Z,F
s.t. 205
n3 X
k=1
kΘ Z(k) Z(k) k1 + λ3 kX − X ∗ Zk2F
F ∈ P,
(13)
where λ1 , λ2 and λ3 are penalty parameters. Once the representation tensor Z is computed in one iteration of CLLRSmC, the segmentation matrix F can be obtained with spectral clustering, which is then used to refine the tensor Z in the next iteration. As suggested in [18], we use a real-valued segmentation matrix F instead of using a binary F to refine the coefficient tensor Z, since the
210
real-valued F carries more detailed information in the clustering results. 4.2. Optimization In this subsection, we describe our approach to solve the optimization problem (13). Our solution consists of the following two steps: • Given F, find the representation tensor Z. 14
215
• With Z fixed, compute the clustering indicator matrix F by spectral clustering. 4.2.1. Solution of the Representation Given a real-valued matrix F, we first compute the matrix Θ with θi,j = 1 i 2 kf
− f j k22 . Then we solve for Z by solving the following subproblem: min kZk~ + λ1 tr(bvec(Z)Ψbvec(Z)T ) Z
+ λ2
n3 X
k=1
kΘ Z(k) Z(k) k1 + λ3 kX − X ∗ Zk2F .
(14)
To solve the optimization problem (14), we introduce three auxiliary variables P, Q and C to make the objective function of (14) separable and reformulate (14) as min kZk~ + λ1 tr(bvec(P)Ψbvec(P)T ) + λ2
Z,Q,C
s.t.
n3 X
k=1
kΘ Q(k) Q(k) k1 + λ3 kX − X ∗ Ck2F
Z = C, P = C, Q = C.
(15)
This problem can be solved using alternating direction method of multipliers (ADMM) [36]. Specifically, the augmented Lagrangian of (15) is given by L(Z, P, Q, C, G1 , G2 , G3 ) = kZk~ + λ1 tr(bvec(P)Ψbvec(P)T ) + λ2
n3 X
k=1
kΘ Q(k) Q(k) k1 + λ3 kX − X ∗ Ck2F
+ hG1 , Z − Ci + hG2 , P − Ci + hG3 , Q − Ci +
µ (kZ − Ck2F + kP − Ck2F + kQ − Ck2F ), 2 (16)
where the tensors G1 , G2 and G3 comprise Lagrangian multipliers and µ > 0 is a penalty parameter. We update each of Z, P, Q and C alternately while keeping 220
the other variables fixed. Updating Z: We update Z by solving the following problem: (t)
Z (t+1) = arg min kZk~ + Z
µ(t) G kZ − (C (t) − 1(t) )k2F . 2 µ
15
(17)
This problem can be solved using [27, Theorem 4.2]. Updating P: Fixing other tensors in (16), the subproblem of updating P can be expressed as (t)
P (t+1) = arg min λ1 tr(bvec(P)Ψbvec(P)T ) + P
µ(t) G kP − (C (t) − 2(t) )k2F . (18) 2 µ
Note that (18) can be rewritten in matrix form as (t)
P(t+1) = arg min λ1 tr(PΨPT ) + P
µ(t) G kP − bvec(C (t) − 2(t) )k2F , 2 µ
(19)
where P = bvec(P) and P(t+1) = bvec(P (t+1) ). By setting the derivative of (19) with respect to P to zero, we have (t)
P(t+1) = µ(t) bvec(C (t) −
G2 )(2λ1 Ψ + µ(t) I)−1 . µ(t)
(20)
Hence, P (t+1) can be obtained by setting P (t+1) = bvfold(P(t+1) ). Updating Q: While other variables are fixed, we update Q as follows: Q(t+1) = arg min λ2 Q
where V (t+1) = C (t) −
n3 X
k=1
kΘ Q(k) Q(k) k1 +
(t)
G3 . µ(t)
µ(t) kQ − V (t+1) k2F , 2
(21)
We can optimize Q slice-by-slice, that is, the sub-
problem is equivalent to solving Q(k)
(t+1)
= arg min λ2 kΘ Q Qk1 + Q
(t+1) µ(t) kQ − V(k) k2F . 2
(22)
By setting the derivative of the objective function in (22) with respect to each (k)(t+1)
qi,j to zero, we have qi,j
(k)(t+1)
=
µ(t) vi,j . 2λ2 θi,j +µ(t)
Updating C: We update C by solving the following problem: µ(t) (t+1) 2 C (t+1) = arg min λ3 kX − X ∗ Ck2F + kF kC − B1 2 C (t+1) 2 (t+1) 2 + kC − B2 kF + kC − B3 kF , (t+1)
where B1
G
(t)
(t+1)
= Z (t+1) + µ1(t) , B2
G
(t)
(t+1)
= P (t+1) + µ2(t) and B3
(23) G
(t)
= Q(t+1) + µ3(t) .
To optimize C efficiently, we have the following subproblem of C in the Fourier domain: (t) (t+1) 2 (t+1) 2 (t+1) 2 b 2F + µ kCb − Bb1 kF + kCb − Bb2 kF + kCb − Bb3 kF , Cb(t+1) = arg min λ3 kXb − Xb ⊗ Ck 2 Cb 16
where ⊗ denotes slicewise multiplication. Therefore, we can again optimize Cb slice-by-slice from the frontal side, that is,
(t) (t+1) b (k)(t+1) = arg min λ3 kX b (k) − B b (k) b (k) − X b (k) C b (k) k2 + µ (kC C k2F F 1 2 (k) b C (t+1)
b (k) − B b (k) + kC 2
It can then be shown that
225
(t+1)
b (k) − B b (k) k2F + kC 3
k2F ).
b (k)(t+1) = (2λ3 X b (k)T X b (k) + 3µ(t) I)−1 C (t+1) (t+1) (t+1) b (k)T X b (k) + µ(t) (B b (k) b (k) b (k) 2λ3 X +B +B ) . 1 2 3
(24)
(25)
After updating Cb(t+1) , it is trivial to compute C (t+1) by using inverse FFT. The ADMM algorithm for solving (14) is summarized in Algorithm 1. 4.2.2. Spectral Clustering Once we have Z available, problem (13) reduces to the following problem: min F
Recall that
P
i,j
X i,j
1 kZ(i, j, :)k2F ( kf i − f j k22 ) 2
s.t. F ∈ P.
(26)
kZ(i, j, :)k2F ( 21 kf i − f j k22 ) = tr(FT MF), where M = D − W
is the graph Laplacian of the affinity matrix W whose (i, j)-th entry is wi,j = 1 2 2 (kZ(i, j, :)kF
+ kZ(j, i, :)k2F ), and D is the degree matrix with its diagonal P entries defined as ∀i, di,i = j wi,j . To make the problem tractable, we relax the constraint F ∈ P to FT DF = I and solve min tr(FT MF) F
s.t.
FT DF = I.
(27)
This problem has a closed-form solution that involves eigendecomposition. More e = D 12 F, we solve for F e instead of F and problem (27) specifically, by letting F turns out to be
e T D− 21 MD− 12 F) e min tr(F e F
s.t.
eT F e = I. F
(28)
e are given by the eigenvectors of the matrix D− 21 MD− 12 The columns of F
230
e to associated with its smallest L eigenvalues. We then normalize each row of F unit `2 norm and set the (i, j)-th entry of Θ to be θi,j = 21 k˜f i − ˜f j k22 ∈ [0, 2]. 17
Algorithm 1: Solving problem (14) by ADMM Input: Data X , matrix Ψ, and parameters λ1 , λ2 and λ3 . Initialize: Tensors (0)
Z (0) = P (0) = Q(0) = C (0) = G1
(0)
= G2
(0)
= G3
= 0 ∈ RN ×N ×n3 ,
µ(0) = 0.1, ρ = 1.9, µmax = 1010 , = 10−5 , and t = 0. 1:
while not converged do
2:
Update Z, P, Q and C according to (17), (18), (21) and (23), respectively.
3:
G1
4:
G2
5:
G3
(t+1)
(t+1)
(t+1)
(t)
= G1 + µ(t) (Z (t+1) − C (t+1) ). (t)
= G2 + µ(t) (P (t+1) − C (t+1) ). (t)
= G3 + µ(t) (Q(t+1) − C (t+1) ).
6:
µ(t+1) = min(µmax , ρµ(t) ).
7:
Check convergence conditions and break if (t+1) (t+1) (t+1) (t+1) (t+1) (t+1) kZ − C k , kP − C k , kQ − C k ∞ ∞ ∞ < . max kZ (t+1) − Z (t) k∞ , kP (t+1) − P (t) k∞ , kQ(t+1) − Q(t) k∞ kC (t+1) − C (t) k ∞
8:
t = t + 1.
9:
end while
Output: Representation tensor Z ∗ = Z (t+1) . 4.2.3. Algorithm Summary We outline our algorithm for solving problem (13) in Algorithm 2. The algorithm alternatives between solving for the self-representation coefficient tensor Z by fixing the segmentation matrix F using Algorithm 1, and solving for F by 235
fixing Z using spectral clustering. Stopping criterion: In practice, Algorithm 2 can be stopped by setting a maximum iteration number Tmax , or when the relative changes of the matrix Θ in two consecutive iterations is smaller than a predefined small parameter ζ, i.e.,
240
kΘ(T ) −Θ(T +1) k1 kΘ(T ) k1
< ζ, where T = 1, 2, . . . is the number of outer iterations.
In this paper, we set ζ = 0.01.
18
Algorithm 2: CLLRSmC Input: Data X , matrix Ψ, and parameters L, λ1 , λ2 and λ3 . Initialize: Θ(0) = 0 and T = 0. 1:
while not converged do
2:
Given F(T ) , solve problem (14) via Algorithm 1 to obtain Z (T +1) .
3:
Given Z (T +1) , solve problem (26) via spectral clustering to obtain F(T +1) .
4:
T = T + 1.
5:
end while
Output: Segmentation matrix F = F(T +1) .
Computation complexity: First, notice that the computational bottleneck of Algorithm 1 only lies in solving the subproblems for Z and C. As for updating Z, we need to calculate the FFT and inverse FFT of an N × N × n3
tensor along mode-3 and d n32+1 e SVDs of N ×N matrices in the Fourier domain,
245
which requires O(2N 2 n3 log(n3 ) + 12 N 3 n3 ) operations in each iteration. As for
the C subproblem, since the inverse matrix in (25) can be calculated in advance, the computation at each iteration will take O(4N 2 n3 log(n3 )). Next, the periteration cost of spectral clustering is O(N 3 ). It therefore follows that the total complexity of Algorithm 2 is O(T1 T2 (6N 2 n3 log(n3 ) + 21 N 3 n3 ) + T2 N 3 ), where 250
T1 is the number of iterations in solving (14) with ADMM and T2 is the number of outer iterations in Algorithm 2.
5. Kernel Clustering-Aware Laplacian Regularized Low-Rank Submodule Clustering The union of free submodules model only considers the t-linear relation255
ship within its individual components, but real-world data in many applications tend to be highly nonlinear. Consider a collection of N images X = {Xj ∈
Rn1 ×n3 }N j=1 , that are drawn from a mixture of nonlinear manifolds. One approach to deal with nonlinear manifolds is to use kernel methods [10, 31, 32, 34], which map the vectorized data {xj ∈ RM }N j=1 , M = n1 n3 , to a high-dimensional 19
260
feature space using the kernel trick [34]. However, such approaches can result in poor clustering performance because they neglect the spatial information of images. In order to address this problem, we first map each column of an image from Rn1 to a high-dimensional feature space F ⊂ Rne1 with n e1 ≫ n1 through a nonlinear map φ : Rn1 → F , and we denote the collection of φ-mapped “im-
265
e = {X e j ’s as lateral slices of e j ∈ Rne1 ×n3 }N . We then again arrange X ages” as X j=1
a tensor Xe ∈ Rne1 ×N ×n3 with ∀j, k, Xe(:, j, k) = φ(X (:, j, k)), and assume these
φ-mapped images lie near a union of L free submodules {`Snne31 }L `=1 of dimensions n e1 {s` < n e 1 }L `=1 . Our goal then is to find these submodules in Kn3 and segment
images into their respective clusters.
In order to exploit the local manifold structure of the data Xe, we assume if
two φ-mapped images Xe(:, i, :) and Xe(:, j, :) are close on a manifold, then their new representations, given by Z(:, i, :) and Z(:, j, :), should be also close. The distance between Xe(:, i, :) and Xe(:, j, :) can be computed as ξ(Xe(:, i, :), Xe(:, j, :)) = kXe(:, i, :) − Xe(:, j, :)k2F = =
=
n3 X
k=1 n3 X k=1
n3 X
k=1
kφ(X (:, i, k)) − φ(X (:, j, k))k22
kφ(X (:, i, k))k22 − 2φ(X (:, i, k))T φ(X (:, j, k)) + kφ(X (:, j, k))k22
κ(X (:, i, k), X (:, i, k)) − 2κ(X (:, i, k), X (:, j, k)) + κ(X (:, j, k), X (:, j, k)) ,
where κ is a positive semidefinite function κ : Rn1 × Rn1 → R that satisfies
κ(x, y) = hφ(x), φ(y)i for all x, y ∈ Rn1 . In this paper, we focus on the use of Gaussian kernels κ(x, y) = exp(−kx − yk22 /c2 ), where c > 0 is the parameter
of the kernel function. We again build an adjacency graph G with the i-th vertex corresponding to Xe(:, i, :) and let Nα (Xe(:, i, :)) denote the set of α nearest
neighbors of Xe(:, i, :) with respect to ξ. The similarity matrix Γ of the graph
G is now set as γi,j = 1, if Xe(:, i, :) ∈ Nα (Xe(:, j, :)) or Xe(:, j, :) ∈ Nα (Xe(:, i, :)),
and γi,j = 0 otherwise. Finally, the Laplacian matrix of the graph G is defined P as Ψ = ∆ − Γ, where ∆ is a diagonal matrix with δi,i = j γi,j . We propose
our kernel clustering-aware Laplacian regularized low-rank submodule clustering
20
(KCLLRSmC) obtained by solving the following variant of (13): min kZk~ + λ1 tr(bvec(Z)Ψbvec(Z)T ) + λ2 Z,F
s.t. 270
n3 X
k=1
F ∈ P,
kΘ Z(k) Z(k) k1 + λ3 kXe − Xe ∗ Zk2F (29)
where λ1 , λ2 and λ3 are penalty parameters. In practice, however, solving (29) directly is likely to be computationally intractable due to the extremely high dimensionality of the feature space. Instead, we are interested in solving (29) using the “kernel trick” [34], which only requires evaluations of inner products in F . Based on our discussion in Section 4.2, the (k) (k) (k)T b b b e e , k = 1, . . . , n3 , where X e X solution of (29) only requires knowledge of X b b denotes the k-th frontal slice of a tensor Xe with Xe = fft(Xe, [ ], 3). The challenge (k) (k)T b b e e X using the kernel trick. then is developing a method that computes X b = fft(H, [ ], 3). Notice that To resolve this, we first define H = XeT ∗ Xe and H
for any k = 1, . . . , n3 , we have the following T
T
b (k) b (k) e b (k) = X e X , H
(30)
(k) (k) b b e e X can be replaced by the k-th frontal slice of which suggests that X
b Now instead of using Xe explicitly for computations, the Fourier tensor H. it suffices to use H to solve (29). Let A ∈ Rn1 ×n2 be a matrix, we define
φ(A) = [φ(a1 ), φ(a2 ), . . . , φ(an2 )] ∈ Rne1 ×n2 to be a matrix whose columns
e (k) as are φ-mapped “images” of the columns in A. We can write each X
e (k) = φ(X(k) ), k = 1, . . . , n3 . Based on the definition of the t-product in X (2), H can also be written as H = bvfold(bcirc(XeT )bvec(Xe)), where φ(X(1) )T φ(X(2) )T . . . φ(X(n3 ) )T (n3 ) T (1) T (n3 −1) T φ(X ) φ(X ) . . . φ(X ) bcirc(XeT ) = .. .. .. .. . . . . φ(X(2) )T
φ(X(3) )T
...
φ(X(1) )T
and bvec(Xe) = [φ(X(1) ); φ(X(2) ); . . . ; φ(X(n3 ) )]. It then follows that H can 0
0
be obtained through the computations of Υk↔k = φ(X(k) )T φ(X(k ) ) with its 21
0
(k0 )
(k)
k↔k individual entries υi,j = κ(xi , xj
275
) for a pre-specified kernel κ, k, k 0 =
b by performing Fourier 1, . . . , n3 . Once H is available, we can readily compute H
b = fft(H, [ ], 3). Effectively, KCLLRSmC transform along mode-3 of H, i.e., H
b (k) in lieu of also relies on Algorithm 2, with the difference being that we use H
b (k)T X b (k) in (25). Finally, it is worth noting that CLLRSmC can be regarded X as a special case of KCLLRSmC with κ(x, y) = hx, yi for all x, y ∈ Rn1 . 6. Experimental Results
In this section, we evaluate the effectiveness of our proposed methods on
280
five datasets: UCSD background subtraction1 , MNIST2 , USPS [37], ORL [38], and Weizmann face dataset3 . We compare our proposed approaches with the following state-of-the-art clustering methods, including LRR [1], SSC [2], LSR [9], SC-LRR [12], TSC [14], S3C [18], KSSC [10], SSmC [25], and SCLRSmC 285
[26]. For the compared methods, we use the codes provided by the authors. For S3C, we fix α = 1 for all the experiments. The tuning parameter in TSC is set q = max(3, dN/(L × 20)e). All other parameters in each of these methods are tuned to achieve the best clustering performance. In order to study the effectiveness of integrating spectral clustering into the optimization problem, we
290
also report the results of CLLRSmC and KCLLRSmC by setting λ2 = 0 in (13) and (29), respectively, and the resulting methods are denoted by CLLRSmC∗ and KCLLRSmC∗ , respectively. For our methods, we fix the number of nearest neighbors to be α = 5. We first tune λ1 and λ3 to achieve the best performance for CLLRSmC∗ and KCLLRSmC∗ , and keep the settings of λ1 and λ3
295
in CLLRSmC and KCLLRSmC the same as in CLLRSmC∗ and KCLLRSmC∗ , respectively. We then iterate CLLRSmC and KCLLRSmC with a maximum of Tmax = 11 iterations. Unless otherwise specified, the default value for λ2 is set to 1 http://www.svcl.ucsd.edu/projects/background_subtraction/ucsdbgsub_dataset.
htm 2 http://yann.lecun.com/exdb/mnist/ 3 http://www.wisdom.weizmann.ac.il/
~/vision/FaceBase/
22
be λ2 = 0.1 in CLLRSmC and λ2 = 10 in KCLLRSmC. In the experiments, we use the clustering accuracy (ACC) and normalized mutual information (NMI) 300
[39] as the evaluation metrics. 6.1. Experiments on UCSD Background Subtraction Dataset
Figure 3: Sample images from the UCSD background subtraction dataset.
The UCSD background subtraction dataset contains 18 video sequences with ground-truth labels. This dataset is challenging due to large variations in the environmental settings including camera motion, cluttered background and es305
pecially highly dynamic backgrounds such as smoke, fire, and water waves. We select 10 sequences from the dataset, namely birds, boats, bottle, chopper, cyclists, flock, hockey, landing, ocean, and skiing. Each category contains 30 to 246 images. Representative frames from some of the sequences are shown in Fig. 3. Our goal is to segment the images into different clusters such that the images
310
in each cluster have the same background. We downsample all the frames into 90 × 135 pixels. In this experiment, we select L ∈ {3, 4, 5, 6, 7} sequences from the 10 categories. We repeat the experiment 20 times for each L. The parameters of our methods are tuned based on the clustering performance for L = 3 and the setting of parameters are used for L ∈ {4, 5, 6, 7}. The averaged clustering
315
results are shown in Table 5, from which we observe that although KCLLRSmC performs the second best for L = 3, it performs better than all other approaches by a large margin for L ∈ {4, 5, 6, 7}. Both CLLRSmC and KCLLRSmC per-
form slightly better than CLLRSmC∗ and KCLLRSmC∗ , respectively. In addition, KCLLRSmC has much better performance than CLLRSmC because of its
320
capability of capturing nonlinear structures within the data.
23
6.2. Handwritten Digits Clustering We evaluate our proposed methods on two widely used handwritten digits datasets: MNIST [40] and USPS [37]. The MNIST dataset comprises 70,000 centered 28 × 28 images of handwritten digits 0 - 9 (see sample images in Fig. 4(a)). 325
We select the first 100 images for each digit, resulting in a subset of 1,000 images. In this experiment, we select L ∈ {2, 3, 5, 8, 10} digits from the 10 classes. The random selection of the digits for each value of L ∈ {2, 3, 5, 8} is repeated 20 times. In order to test the robustness of our proposed methods against spatial shifts, we randomly shift the digits horizontally with respect to the center by 3
330
pixels either side (left or right). We tune the parameters of our methods based on the clustering performance for L = 10. Then we use these parameters for all L ∈ {2, 3, 5, 8} on this dataset. The clustering results for all the methods are reported in Table 6. From the results, we make the following conclusions: (i) both CLLRSmC∗ and KCLLRSmC∗ outperform all other methods in all cases;
335
(ii) both the CLLRSmC and KCLLRSmC consistently outperform CLLRSmC∗ and KCLLRSmC∗ for all L’s, respectively; and (iii) submodule clustering methods achieve significant improvement compared with all state-of-the-art subspace clustering methods, which can be attributed to the effectiveness of tensor based methods for clustering of unaligned imaging data.
(a)
(b)
Figure 4: Sample images from the (a) MNIST and (b) USPS datasets.
340
We also examine the running time of all the methods on MNIST dataset with 24
L = 3. All experiments are carried out using Matlab R2018a on an Intel i52540M 2.6GHz CPU with 6 GB RAM. Note that CLLRSmC∗ and KCLLRSmC∗ only perform the first iteration of CLLRSmC and KCLLRSmC, respectively. As Table 1 shows, submodule clustering methods tend to take more time compared 345
with subspace clustering methods. KCLLRSmC∗ takes slightly more time compared to CLLRSmC∗ because KCLLRSmC∗ needs to compute the tensor H. However, the advantage of KCLLRSmC∗ (resp., KCLLRSmC) over CLLRSmC∗
(resp., CLLRSmC) in clustering performance significantly outweighs this slight increase in computational complexity. Table 1: Running time comparison (in sec) for different methods on MNIST dataset.
LRR
SSC
LSR1
LSR2
SC-LRR
TSC
S3C
9.57
1.91
0.10
0.08
14.81
0.32
96.36
KSSC
SSmC
SCLRSmC
CLLRSmC∗
CLLRSmC
KCLLRSmC∗
KCLLRSmC
0.60
20.17
42.31
22.13
83.95
27.92
106.84
Table 2: Clustering performance on USPS dataset. The best results are in bold font. We set λ1 = 500, λ2 = 0.1, and λ3 = 0.008 for CLLRSmC. The parameters for KCLLRSmC are λ1 = 800, λ2 = 10, and λ3 = 0.2, and c = 4.
350
LRR
SSC
LSR1
LSR2
SC-LRR
TSC
S3C
ACC
0.3700
0.3640
0.4510
0.4580
0.5090
0.4270
0.2390
NMI
0.4555
0.4309
0.4566
0.4541
0.5352
0.5156
0.2565
KSSC
SSmC
SCLRSmC
CLLRSmC∗
CLLRSmC
KCLLRSmC∗
KCLLRSmC
ACC
0.5210
0.5100
0.6090
0.7850
0.7950
0.6830
0.7020
NMI
0.4279
0.4909
0.5560
0.6604
0.6749
0.6381
0.6494
The USPS dataset [37] consists of 9298 images of 10 handwritten digits. Fig. 4(b) shows some sample images, and each image has 16 × 16 pixels. Similar to the experiment on MNIST, we use the first 100 images of each digit for experiments; hence N = 1000. We again add a random (left or right) 3-pixel
25
horizontal shift to each image. Experimental results are summarized in Ta355
ble 2. We observe that CLLRSmC yields noticeable improvement in accuracy compared to other algorithms. 6.3. Face Clustering
(a)
(b)
Figure 5: Sample images from the (a) ORL and (b) Weizmann datasets. Table 3: Clustering performance on ORL dataset. The best results are in bold font. We set λ1 = 150, λ2 = 0.1, and λ3 = 0.4 for CLLRSmC. The parameters for KCLLRSmC are λ1 = 100, λ2 = 10, λ3 = 0.35, and c = 1.5.
LRR
SSC
LSR1
LSR2
SC-LRR
TSC
S3C
ACC
0.7800
0.8525
0.7475
0.7725
0.8050
0.5250
0.7825
NMI
0.8782
0.9224
0.8721
0.8856
0.9086
0.7742
0.9243
KSSC
SSmC
SCLRSmC
CLLRSmC∗
CLLRSmC
KCLLRSmC∗
KCLLRSmC
ACC
0.7600
0.7975
0.8525
0.8625
0.8825
0.8850
0.8925
NMI
0.8613
0.8862
0.9198
0.9272
0.9369
0.9380
0.9423
We present results of different clustering methods on ORL and Weizmann face datasets. The ORL face dataset [38] contains 40 distinct subjects and 360
each subject has 10 different images (see Fig. 5(a) for sample images from this dataset). The images are taken at different times with varying lighting conditions, poses, and facial expressions. In the experiment, we use all the 400 images and resize each image into 32 × 32. Table 3 presents the clustering re-
sults for all the methods. As can be seen from this table, both the CLLRSmC∗
26
365
and KCLLRSmC∗ outperform all other algorithms in both ACC and NMI, and CLLRSmC and KCLLRSmC further reduce the clustering error significantly. The Weizmann face database contains grayscale images of 27 different individuals with variations of pose, illumination and expression. Due to variations in pose, it can be argued that images of each subject approximately lie near
370
a submodule. We select a subset of this dataset composed of 4 subjects, with N = 120. Fig. 5(b) shows examples of images of the selected subjects. The original resolution of the images is 512 × 352. We downsample the images by a factor of 4 along each axis and crop them into 120 × 80 pixels. The clustering results are listed in Table 4. We can see that CLLRSmC again outperforms
375
all other methods. Note that the clustering accuracy of CLLRSmC can be further increased to 0.9000 with λ1 = 0.001, λ2 = 0.3, and λ3 = 0.001, while the accuracy of CLLRSmC∗ with the same λ1 and λ3 is only 0.7583. Table 4: Clustering performance on Weizmann dataset. The best results are in bold font. We set λ1 = 0.001, λ2 = 0.1, and λ3 = 0.005 for CLLRSmC. The parameters for KCLLRSmC are λ1 = 0.01, λ2 = 10, and λ3 = 0.5, and c = 8.
LRR
SSC
LSR1
LSR2
SC-LRR
TSC
S3C
ACC
0.6667
0.6833
0.6750
0.6750
0.6083
0.5000
0.5000
NMI
0.5674
0.5802
0.6039
0.5773
0.5153
0.3345
0.5170
KSSC
SSmC
SCLRSmC
CLLRSmC∗
CLLRSmC
KCLLRSmC∗
KCLLRSmC
ACC
0.5750
0.8083
0.7500
0.8500
0.8750
0.8167
0.8167
NMI
0.4317
0.7600
0.7632
0.7801
0.8287
0.8335
0.8335
In order to study the robustness of our methods, we add white Gaussian noise to the images. We use X noise to denote the noisy samples, such that every sample 380
X noise (:, i, :) can be expressed as X noise (:, i, :) = bvfold(bvec(X (:, i, :)) + ϕi ),
where ϕi ∈ RM is additive white Gaussian noise with zero mean and variance
σ 2 kX (:, i, :)k2F . In the experiment, σ 2 ranges from 0.05 to 0.2 and for each σ 2 , we
generate X noise (:, i, :) 10 times. Fig. 6 shows the averaged clustering results for
the best 9 methods. We can again infer that CLLRSmC achieves the highest 27
0.85
ACC
0.81
SSC LSR1 LSR2 SSmC SCLRSmC
0.8
CLLRSmC CLLRSmC KCLLRSmC KCLLRSmC
SSC LSR1 LSR2 SSmC SCLRSmC
0.76 0.72
NMI
0.89
0.77 0.73
CLLRSmC CLLRSmC KCLLRSmC KCLLRSmC
0.68 0.64
0.69
0.6
0.65 0.05
0.1
0.15
noise level (
2
0.2
0.56 0.05
0.1
)
0.15
noise level (
(a)
2
0.2
)
(b)
Figure 6: Clustering performance on the noisy Weizmann dataset. 80
80
60
60
40
40
20
20
0
0
1 2 3 4 5 6 7 8 9 10 11 number of iterations to convergence
1 2 3 4 5 6 7 8 9 10 11 number of iterations to convergence
(a)
(b)
Figure 7: Convergence results on the USPS dataset.
385
clustering accuracy and the clustering performance improves drastically with the help of the segmentation matrix F for both CLLRSmC and KCLLRSmC. 6.4. Convergence Analysis In this subsection, we show the convergence of our proposed methods empirically. Specifically, we randomly sample L ∈ {2, 3, 5, 8} digits from the USPS
390
dataset. For each L, we randomly sample 20 times and show the histogram of the numbers of iterations for our methods to converge with λ2 ∈ {0.1, 1, 10}. In Figure 7, we can observe that CLLRSmC converges in 2 ∼ 8 iterations and KCLLRSmC converges in 2 ∼ 5 iterations on average. 6.5. Parameter Sensitivity
395
There are three regularized parameters that can affect the performance of CLLRSmC and KCLLRSmC. In the following, we conduct experiments on ORL 28
dataset to show the effect of the parameters λ1 and λ3 on the performance of CLLRSmC and KCLLRSmC. We again fix λ2 = 0.1 for CLLRSmC and λ2 = 10 for KCLLRSmC. The ACC and NMI results with different combinations of λ1 400
and λ3 are given in Fig. 8(a) and Fig. 8(b), respectively. It is evident that competitive performance can be obtained over a wide range of parameters. In Fig. 8(c) and Fig. 8(d), we show the performance of KCLLRSmC with different combinations of λ1 and λ3 . From the results, we can again conclude that KCLLRSmC has promising results that are insensitive to various combinations
405
of λ1 and λ3 . 0.88
0.935 0.87 1
1
0.86
0.6
0.85
0.4 0.2 0 50
100
150
200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.84
0.930
0.8
0.925
0.6 NMI
ACC
0.8
0.4 0.2 0 50
100
3
1
150
200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.920
3
1
(a) ACC of CLLRSmC with different values (b) NMI of CLLRSmC with different values of λ1 and λ3 .
of λ1 and λ3 . 0.944
0.89
1
0.8
0.6
0.87
0.4 0.2 0 50
100
150
200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.86
0.6 NMI
ACC
0.940
1
0.88
0.8
0.936
0.4 0.2 0 50
100
150
3
1
200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.932
3
1
(c) ACC of KCLLRSmC with different values (d) NMI of KCLLRSmC with different values of λ1 and λ3 .
of λ1 and λ3 .
Figure 8: Parameter tuning with respect to λ1 and λ3 on the ORL dataset.
We conclude by noting that the choice of kernels in these experiments is agnostic to the data. Nonetheless, selection of the optimal kernel function is an active area of research in kernel-based methods, which is sometimes studied under the moniker of multiple kernel learning [33, 41, 42]. While some of these 410
works can be leveraged to further improve the performance of our proposed algorithms, a careful investigation of this is beyond the scope of this work. 29
7. Conclusion In this paper, we proposed a joint optimization approach for submodule clustering, termed clustering-aware Laplacian regularized low-rank submodule clus415
tering (CLLRSmC), for clustering of imaging data. The proposed CLLRSmC approach explicitly considers the manifold structures of the data and combines the two separate stages of computing the representation tensor and applying spectral clustering into a unified framework. We developed an efficient iterative algorithm that seeks the optimal solution of the joint optimization problem
420
based on ADMM and spectral clustering. Moreover, we introduced a kernel extension of CLLRSmC, called KCLLRSmC, such that one can deal with nonlinear manifolds. Extensive experimental results on image clustering showed the efficacy of the proposed methods and their superiority over the state-of-the-art alternatives. In addition, the performance of CLLRSmC on UCSD background
425
subtraction and ORL datasets can be significantly improved when we use its non-linear counterpart, KCLLRSmC, for clustering.
8. Acknowledgements The author would like to thank the anonymous reviewers for their valuable comments and suggestions.
430
References [1] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 171–184. [2] E. Elhamifar, R. Vidal, Sparse subspace clustering: Algorithm, theory,
435
and applications, IEEE Trans. Pattern Anal. Mach. Intell. 35 (11) (2013) 2765–2781.
30
[3] R. Vidal, Y. Ma, S. Sastry, Generalized principal component analysis (GPCA), IEEE Trans. Pattern Anal. Mach. Intell. 27 (12) (2005) 1945– 1959. 440
[4] P. S. Bradley, O. L. Mangasarian, k-plane clustering, J. Global Optim. 16 (1) (2000) 23–32. [5] T. Zhang, A. Szlam, Y. Wang, G. Lerman, Hybrid linear modeling via local best-fit flats, Int. J. Comput. Vis. 100 (3) (2012) 217–240. [6] M. A. Fischler, R. C. Bolles, Random sample consensus: A paradigm for
445
model fitting with applications to image analysis and automated cartography, Commun. ACM 24 (6) (1981) 381–395. [7] A. Leonardis, H. Bischof, J. Maver, Multiple eigenspaces, Pattern Recognit. 35 (11) (2002) 392–403. [8] A. Gruber, Y. Weiss, Multibody factorization with uncertainty and missing
450
data using the EM algorithm, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2004, pp. 707–714. [9] C. Lu, H. Min, Z. Zhao, L. Zhu, D. Huang, S. Yan, Robust and efficient subspace segmentation via least squares regression, in: Proceedings of the European Conference on Computer Vision, 2012, pp. 347–360.
455
[10] V. M. Patel, R. Vidal, Kernel sparse subspace clustering, in: Proceedings of the IEEE International Conference on Image Processing, 2014, pp. 2849– 2853. [11] J. Liu, Y. Chen, J. Zhang, Z. Xu, Enhancing low-rank subspace clustering by manifold regularization, IEEE Trans. Image Process. 23 (9) (2014) 4022–
460
4030. [12] K. Tang, R. Liu, Z. Su, J. Zhang, Structure-constrained low-rank representation, IEEE Trans. Neural Netw. Learn. Syst. 25 (12) (2014) 2167–2179.
31
[13] R. Vidal, P. Favaro, Low rank subspace clustering (LRSC), Pattern Recognit. Lett. 43 (2014) 47–61. 465
[14] R. Heckel, H. B¨ olcskei, Robust subspace clustering via thresholding, IEEE Trans. Inf. Theory 61 (11) (2015) 6320–6342. [15] T. Wu, P. Gurram, R. M. Rao, W. U. Bajwa, Hierarchical union-ofsubspaces model for human activity summarization, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015,
470
pp. 1053–1061. [16] M. Yin, J. Gao, Z. Lin, Laplacian regularized low-rank representation and its applications, IEEE Trans. Pattern Anal. Mach. Intell. 38 (3) (2016) 504–517. [17] S. Xiao, M. Tan, D. Xu, Z. Y. Dong, Robust kernel low-rank representation,
475
IEEE Trans. Neural Netw. Learn. Syst. 27 (11) (2016) 2268–2281. [18] C.-G. Li, C. You, R. Vidal, Structured sparse subspace clustering: A joint affinity learning and subspace clustering framework, IEEE Trans. Image Process. 26 (6) (2017) 2988–3001. [19] Q. Li, W. Liu, L. Li, Affinity learning via a diffusion process for subspace
480
clustering, Pattern Recognit. 84 (2018) 39–50. [20] K. Braman, Third-order tensors as linear operators on a space of matrices, Linear Algebra Appl. 433 (7) (2010) 1241–1253. [21] M. E. Kilmer, C. D. Martin, Factorization strategies for third-order tensors, Linear Algebra Appl. 435 (3) (2011) 641–658.
485
[22] M. E. Kilmer, K. Braman, N. Hao, R. C. Hoover, Third-order tensors as operators on matrices: A theoretical and computational framework with applications in imaging, SIAM J. Matrix Anal. Appl. 34 (1) (2013) 148– 172.
32
[23] S. Velasco-Forero, J. Angulo, Classification of hyperspectral images by ten490
sor modeling and additive morphological decomposition, Pattern Recognit. 46 (2) (2013) 566–577. [24] Z. Zhang, G. Ely, S. Aeron, N. Hao, M. Kilmer, Novel methods for multilinear data completion and de-noising based on tensor-SVD, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
495
2014, pp. 3842–3849. [25] E. Kernfeld, S. Aeron, M. Kilmer, Clustering multi-way data: A novel algebraic approach, arXiv:1412.7056v2. [26] T. Wu, W. U. Bajwa, A low tensor-rank representation approach for clustering of imaging data, IEEE Signal Process. Lett. 25 (8) (2018) 1196–1200.
500
[27] C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, S. Yan, Tensor robust principal component analysis with a new tensor nuclear norm, IEEE Trans. Pattern Anal. Mach. Intell. [28] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396.
505
[29] X. He, D. Cai, S. Yan, H.-J. Zhang, Neighborhood preserving embedding, in: Proceedings of the IEEE International Conference on Computer Vision, 2005, pp. 1208–1213. [30] D. Cai, X. He, J. Han, T. S. Huang, Graph regularized nonnegative matrix factorization for data representation, IEEE Trans. Pattern Anal. Mach.
510
Intell. 33 (8) (2011) 1548–1560. [31] H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, R. Chellappa, Design of non-linear kernel dictionaries for object recognition, IEEE Trans. Image Process. 22 (12) (2013) 5123–5135. [32] T. Wu, W. U. Bajwa, Learning the nonlinear geometry of high-dimensional
515
data: Models and algorithms, IEEE Trans. Signal Process. 63 (23) (2015) 6229–6244. 33
[33] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, M. I. Jordan, Learning the kernel matrix with semidefinite programming, J. Mach. Learn. Res. 5 (2004) 27–72. 520
[34] B. Sch¨ olkopf, A. J. Smola, K.-R. M¨ uller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput. 10 (5) (1998) 1299–1319. [35] A. Y. Ng, M. I. Jordan, Y. Weiss, On spectral clustering: Analysis and an algorithm, in: Proceedings of the Neural Information Processing Systems, 2001, pp. 849–856.
525
[36] Z. Lin, R. Liu, Z. Su, Linearized alternating direction method with adaptive penalty for low-rank representation, in: Proceedings of the Neural Information Processing Systems, 2011, pp. 612–620. [37] J. J. Hull, A database for handwritten text recognition research, IEEE Trans. Pattern Anal. Mach. Intell. 16 (5) (1994) 550–554.
530
[38] F. S. Samaria, A. C. Harter, Parameterisation of a stochastic model for human face identification, in: Proceedings of the IEEE Workshop on Applications of Computer Vision, 1994, pp. 138–142. [39] N. X. Vinh, J. Epps, J. Bailey, Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for
535
chance, J. Mach. Learn. Res. 11 (2010) 2837–2854. [40] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. [41] M. Kloft, U. Brefeld, S. Sonnenburg, A. Zien, `p -norm multiple kernel learning, J. Mach. Learn. Res. 12 (2011) 953–997.
540
[42] J. Zhuang, J. Wang, S. C. H. Hoi, X. Lan, Unsupervised multiple kernel learning, in: Proceedings of the Asian Conference on Machine Learning, Vol. 20, 2011, pp. 129–144.
34
35
0.9567
0.9542
0.7751
0.8324
0.7522
0.9975
0.8778
0.8784
SC-LRR
TSC
S3C
KSSC
SSmC
SCLRSmC
CLLRSmC∗
CLLRSmC
KCLLRSmC
0.9761
0.8191
LSR2
KCLLRSmC
0.8190
LSR1
0.9722
0.9607
SSC
∗
0.9627
ACC
LRR
No. clusters (L)
3
0.9833
0.9745
0.8262
0.8259
0.9937
0.7165
0.8248
0.7521
0.9341
0.9395
0.7991
0.8073
0.9556
0.9563
NMI
0.9734
0.9734
0.8655
0.8580
0.9293
0.7086
0.8346
0.7920
0.7959
0.7821
0.8000
0.8063
0.8366
0.8910
ACC
4
0.9800
0.9800
0.8664
0.8600
0.9441
0.7552
0.8431
0.8635
0.8554
0.8300
0.8067
0.8113
0.8861
0.9060
NMI
0.9286
0.9286
0.8186
0.8158
0.8772
0.7037
0.7740
0.8485
0.7660
0.6873
0.7047
0.7501
0.7863
0.8569
ACC
5
0.9522
0.9522
0.8570
0.8528
0.9140
0.7700
0.8317
0.9167
0.8494
0.7828
0.7347
0.7791
0.8633
0.8964
NMI
0.9316
0.9274
0.7856
0.7728
0.9105
0.6480
0.7914
0.8643
0.8234
0.6795
0.7185
0.7365
0.7824
0.8293
ACC
λ3 = 0.01 for CLLRSmC. The parameters for KCLLRSmC are λ1 = 0.001, λ2 = 10, and λ3 = 0.001, and c = 1.
6
0.9605
0.9576
0.8390
0.8281
0.9415
0.7529
0.8703
0.9298
0.8999
0.8035
0.7884
0.8038
0.8818
0.8887
NMI
0.9245
0.9034
0.7474
0.7454
0.8772
0.6379
0.7463
0.8106
0.7332
0.6315
0.6704
0.7099
0.7272
0.8025
ACC
7
0.9600
0.9477
0.8258
0.8231
0.9367
0.7480
0.8587
0.9103
0.8695
0.7814
0.7937
0.8193
0.8630
0.8813
NMI
Table 5: Clustering performance on UCSD background subtraction dataset. The best results are in bold font. We set λ1 = 0.1, λ2 = 0.1, and
36
0.5555
0.5662
0.5553
0.6137
0.8265
0.8763
0.9255
0.9277
SC-LRR
TSC
S3C
KSSC
SSmC
SCLRSmC
CLLRSmC∗
CLLRSmC
KCLLRSmC
0.9467
0.6275
LSR2
KCLLRSmC
0.6280
LSR1
0.9363
0.6043
SSC
∗
0.6165
ACC
LRR
No. digits (L)
2
0.7673
0.7337
0.7038
0.6970
0.5705
0.4350
0.1119
0.0245
0.0380
0.0229
0.1089
0.1097
0.0937
0.1186
NMI
0.7545
0.7372
0.6985
0.6923
0.6720
0.6373
0.5148
0.4742
0.4918
0.4923
0.4963
0.4890
0.5040
0.5150
ACC
3
0.5195
0.5235
0.4693
0.4717
0.4314
0.3338
0.2134
0.1944
0.2241
0.2245
0.2166
0.1974
0.2230
0.2370
NMI
0.6396
0.6295
0.6221
0.6179
0.5621
0.5148
0.4503
0.3929
0.4440
0.4262
0.3790
0.3969
0.3909
0.3903
ACC
The parameters for KCLLRSmC are λ1 = 20, λ2 = 10, and λ3 = 0.005, and c = 2.
5
0.5411
0.5292
0.5206
0.5172
0.4330
0.3655
0.3212
0.3222
0.3888
0.3620
0.2916
0.3037
0.3304
0.3302
NMI
0.5859
0.5793
0.5878
0.5848
0.5202
0.4676
0.3949
0.3321
0.4115
0.3869
0.3070
0.3186
0.3659
0.3412
ACC
8
0.5325
0.5167
0.5343
0.5308
0.4585
0.3961
0.3687
0.3444
0.4571
0.4095
0.2584
0.2643
0.3870
0.3817
NMI
0.5740
0.5720
0.5660
0.5560
0.4990
0.4300
0.3630
0.3390
0.4360
0.4160
0.3230
0.3250
0.3770
0.3160
ACC
10
0.5116
0.5084
0.5344
0.5194
0.4900
0.4036
0.3673
0.4198
0.4739
0.3998
0.2972
0.3074
0.3746
0.3434
NMI
Table 6: Clustering performance on MNIST dataset. The best results are in bold font. We set λ1 = 200, λ2 = 0.1, and λ3 = 0.001 for CLLRSmC.
Conflict of Interest Authors confirm that there are no known conflicts of interest associated with 545
this publication and there has been no significant financial support for this work that could have influenced its outcome. Tong Wu received BE degree in Instrument Science and Engineering from Shanghai Jiao Tong University, China in 2009, MS degree in Electrical Engineering from Duke University in 2011, and PhD degree in Electrical and Computer
550
Engineering from Rutgers University in 2017. His research interests include high-dimensional data analysis, statistical signal processing, image and video processing, and machine learning.
37