Robust sequential subspace clustering via ℓ1-norm temporal graph

Robust sequential subspace clustering via ℓ1-norm temporal graph

Robust sequential subspace clustering via ‘1 -norm temporal graph Communicated by Prof. Liu Guangcan Journal Pre-proof Robust sequential subspace c...

2MB Sizes 0 Downloads 60 Views

Robust sequential subspace clustering via ‘1 -norm temporal graph

Communicated by Prof. Liu Guangcan

Journal Pre-proof

Robust sequential subspace clustering via ‘1 -norm temporal graph Wenyu Hu, Shenghao Li, Weidong Zheng, Yao Lu, Gaohang Yu PII: DOI: Reference:

S0925-2312(19)31721-7 https://doi.org/10.1016/j.neucom.2019.12.019 NEUCOM 21647

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

12 June 2019 21 October 2019 3 December 2019

Please cite this article as: Wenyu Hu, Shenghao Li, Weidong Zheng, Yao Lu, Gaohang Yu, Robust sequential subspace clustering via ‘1 -norm temporal graph, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.12.019

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Highlights • Construct a `1 -norm temporal graph to exploit the temporal information in sequential data, where the importance of neighboring frames is not equally treated. • Our model is achieved by integrating classical SSC and the proposed `1 -norm temporal graph. • Design an efficient algorithm based on proximity operator, where each subproblem has a closed-form solution. • The convergence is theoretically guaranteed, although the objective function is nonsmooth. • Extensive experiments are conducted on both synthetic and real-world data sets.

1

Robust sequential subspace clustering via `1-norm temporal graph Wenyu Hua , Shenghao Lia , Weidong Zhenga , Yao Lub,∗, Gaohang Yuc,∗ a

School of Mathematics and Computer Sciences, Gannan Normal University, Ganzhou, P.R. China b School of Data and Computer Science, Sun Yat-sen University, Guangzhou, P.R. China c Department of Mathematics, School of Science, Hangzhou Dianzi University, Hangzhou, P.R. China

Abstract Subspace clustering (SC) has been widely applied to segment data drawn from multiple subspaces. However, for sequential data, a main challenge in subspace clustering is to exploit temporal information. In this paper, we propose a novel robust sequential subspace clustering approach with a `1 -norm temporal graph. The `1 -norm temporal graph is designed to encode the temporal information underlying in sequential data. By using the `2 norm, it can enforce well temporal similarity of neighboring frames with a sampledependent weight, and mitigate the effect of noises and outliers on subspace clustering because large errors mixed in the real data can be suppressed. Under assumption of data self-expression, our clustering model is put forward by further integrating the classical Sparse Subspace Clustering and the `1 -norm Temporal Graph (SSC-L1TG). To solve the proposed model, we introduce a new efficient proximity algorithm. At each iteration, the sub-problem is solved by proximal minimization with closed-form solution. In contrast to the alternating direction method of multipliers (ADMM) employed in most existing clustering approaches without convergence guarantee, the proposed SSC-L1TG is guaranteed to converge to the desired optimal solution. Experimental results on both synthetic and real data demonstrate the efficacy of our method and its superior performance over the state-of-the-art methods. ∗

Corresponding author. E-mail addresses: [email protected] (W. Zheng), Zhengwd [email protected] [email protected] (G. Yu). Preprint submitted to Elsevier

(W. Hu), [email protected] (S. Li), [email protected] (Y. Lu),

December 12, 2019

Keywords: Sequential data, Sparse subspace clustering (SSC), Low rank representation (LRR), Proximal gradient, `2,1 norm 1. Introduction Subspace clustering (SC) attracts a lot of attention recently and finds numerous applications in computer vision and image processing, e.g. motion segmentation [1], face image clustering [2] and video surveillance [3]. Given a set of points drawn from a union of linear subspaces, SC aims to partition data into different clusters with each cluster corresponding to one subspace. The key idea of SC is to learn and represent the underlying subspace structure. From the perspective of mechanisms for representing subspaces, existing methods for SC can be roughly divided into four categories, including iterative [4], algebraic [5, 6], statistical [7] and spectral-based methods [2, 8, 9]. An elaborate review of these methods can be found in [10]. In this paper, we only focus on the spectral-based methods, especially on two representative methods: Sparse Subspace Clustering (SSC) [2] and Low-Rank Representation (LRR) [8, 11] and their variants [9, 12, 13, 14]. Spectral-based SC methods begin with learning an affinity matrix whose entries measure the similarities among the data points and then perform spectral clustering algorithm such as K-means or normalized cuts (Ncut) [15] on the learned affinity matrix to segment data. Obviously, constructing a proper affinity matrix plays a crucial role in the process of subspace clustering. Methods in this class usually only differ in the way of constructing the affinity matrix. Ideally, a “good” affinity matrix should be block diagonal where the between-cluster affinities are all zeros. A typical choice for the measure of similarity is the pairwise distance (e.g. Euclidean distance) between each pair of samples. However, such method is unable to characterize the underlying subspace structure of data, and thus the affinity matrix is not block diagonal. Inspired by sparse and low-rank signal recovery algorithms [16], SSC and LRR provide new ways to construct the affinity matrix. They process the SC problem as one of finding a sparse or low-rank representation of the data in the dictionary of itself. And the representation matrix is used as an affinity matrix. When the subspaces are independent subspaces, both SSC and LRR possess the block diagonal property [2]. One problem is that SSC finding the 3

sparsest representation of each data point individually can result a representation matrix which is too sparse and likely cause over-segmentation. LRR aims to find the lowest rank representation of all data points jointly and so is better at capturing the global structures than SSC, but it requires the sampling is sufficient. Since both sparse and low rank optimization problems are non-convex and NP-hard, SSC and LRR consider minimizing the convex relaxations of the initial models by solving the `1 norm minimization problem [2] or nuclear norm minimization problem [11]. By now, many methods have been proposed to improve the performance of SSC and LRR, such as the `p -norm SSC [17], structure-constrained LRR [18], the nonconvex LRR [13, 19], the Least Squares Regression (LSR) [9, 20], the Correlation Adaptive Subspace Segmentation (CASS) [12], the deeply Structured AutoEncoder (StructAE) [21] and the L2 -graph method based upon the property of intrasubspace projection dominance (IPD) [22]. Moreover, to handle corrupted data with mixture structure of multiple subspaces, LRR was improved and applied to recover the mixture data via pursuing an appropriate dictionary [23]. These algorithms based on SSC and LRR have achieved great success in many areas, but few take into account the sequential properties when data for clustering have ordered or sequential structure [24]. In fact, in many clustering scenarios, the data, e.g. video or other time sequential data, spatial adjacent signals and continuous frequency spectrum data, often have ordered or sequential structure. The common feature of sequential data is that nearby frames sharing similar features likely belong to the same subspace, thus it is much important for clustering to exploit the temporal relationship among neighboring frames. The spatial subspace clustering (SpatSC) method proposed by Guo et al. [25] is such an attempt, which extends SSC by incorporating an extra penalty term to model the temporal ordering of data. The penalty term is based on `1 norm and aims to force similarity of neighboring data points in the affinity matrix. But, using `1 norm only imposes sparsity at element level in the column difference of affinity matrix and does not directly penalize whole column similarity. In fact, it allows some values in consecutive columns to be greatly different. To overcome the disadvantage of enforcing similarity individually, the ordered subspace clustering (OSC) was proposed in [26] by replacing the `1 norm with the `1,2 norm, which ensures the learned coefficient matrix is sparse in columns. However, OSC has some shortcomings as well. First, OSC only takes into account the global temporal information of sequential data 4

ignoring local structural information among data points. For example, if two adjacent frames are drawn from different subspaces, their representation vectors should be encouraged to be far from each other. On the contrary, if two adjacent frames are from the same subspaces, their representation vectors should be encouraged to be as close as possible. Nevertheless, OSC always adopts the same strategy disregarding the category of neighboring frames. Second, the optimization scheme (ADMM) used in [26] lacks a guarantee of convergence and also suffers from huge computation cost due to solving the Sylvester equation iteratively in the optimization process, which requires high computational complexity for large-scale matrix. Given these deficiencies, some improvements were made in [27] through adoption of the LADMAP (linearized alternating direction method with adaptive penalty) [28] and LADMPSAP (linearized alternating direction method with parallel splitting and adaptive penalty) [29] frameworks. However, new variables and parameters were introduced there. This increases the computational load and has to tune more parameters which is not an easy job. Besides, the convergence analysis is rather complex due to inheriting the features of LADMAP and LADMPSAP. More recently, Li et al. [30] proposed a temporal subspace clustering (TSC) approach by incorporating a temporal Laplacian regularization. But the quadratic form of graph embedding used in TSC results in that the clustering results are sensitive to data noise and outliers [31]. Therefore, it is necessary to devise a more flexible and scalable algorithm for sequential subspace clustering. To address the above problems, this paper proposes a novel SSC plus `1 -norm temporal graph model (SSC-L1TG) to perform sequential subspace clustering. The `1 -norm temporal graph is applied to learn the temporal information of sequential data, where the adaptive distance weight between adjacent frames is used to preserve the local neighbor relationship of data. Instead of using squared `2 norm for spectral embedding as traditional methods like TSC, our model is regularized by a `2 norm based temporal term and thus is more robust to data noise and outliers. Moreover, we develop an efficient iterative algorithm to solve the clustering model by using proximal algorithms. We find that the objective consists of three terms that one is smooth with a Lipschitz continuous gradient and the other two are nonsmooth but convex. By employing the concepts of proximity operator and subdifferential [32, 33, 34], we set up the fixed-point equation system of the clustering model and design a proper iterative scheme based on the technique of matrix splitting. In our method, the subproblems to be solved have closed5

form solution and more importantly the proof of convergence can be easily achieved. Extensive experiments on synthetic and real datasets demonstrate that SSC-L1TG is an effective method. In summary, the contributions of this paper are as follows: (1) A new temporal similarity measurement, called the `1 -norm temporal graph, is proposed for sequential data clustering by incorporating the classical SSC model. (2) A simple and effective algorithm is proposed based on proximity operator. We also theoretically demonstrate the convergence of the proposed algorithm. The rest of this paper is organized as follows. In Section 2, we briefly review low-rank and sparse representation based SC and sequential SC methods. In Section 3, we derive the proposed method SSC-L1TG. Extensive experiments are carried out in Section 4. Section 5 concludes this paper. 2. Related work Before reviewing the related work, we define some notations. All matrices and vectors in this paper are represented with boldface uppercase and lowercase symbols, respectively. For a matrix M, Mij is its (i, j)-th entry, mi is its i-th row and mi is its i-th column. For a vector v, vi is its i-th entry. MT and M−1 are the transpose and inverse of matrix M, respectively, while vT is the transpose of vector v. Id is used to denote the identity matrix with the size of d × d, tr(·) is the trace of a matrix. The inner product of two matrices M and N is defined by hM, Ni = tr(MT N). In addition, kMk1 , kMk1,2 , kMk2,1 , kMkF and kMk∗ denote the `1 norm, `1,2 norm, `2,1 norm, Frobenius norm and nuclear norm where P Pof thei matrix M, respectively, n kMk1,2 = i kmi k2 while kMk2,1 = i km k2 . Let S+ (S+ ) be the set of all symmetric positive definite matrices (in Rn×n ) with compatible sizes. For every D ∈ S+ , we denote hM, NiD := hM, DNi. 2.1. Subspace clustering by SSC and LRR Suppose that there is a set of data points X = [X1 , · · · , Xk ] = [x1 , · · · , xn ] ∈ Rd×n drawn from a union of c linear subspaces {Si }ci=1 . P Let Xi be a collection of ni data vectors drawn from the subspace Si , n = ci=1 ni . The task of subspace clustering is to segment the data according to the underlying subspaces from which samples are drawn. 6

By making a hypothesis that each data point can be represented by linear combination of other data points in the same subspace, Sparse Subspace Clustering (SSC) [2, 10] solves the following sparse representation problem: min kX − XZk2F + λkZk0 Z

s.t. diag(Z) = 0,

(1)

where Z is the sparse representation coefficient matrix and λ is the weight to balance the two terms. Such an optimizaiton problem is non-convex and NP-hard, so SSC usually uses `1 norm to approximate the `0 norm. The obtained model is min kX − XZk2F + λkZk1 Z

s.t. diag(Z) = 0.

(2)

When the data are drawn from independent subspaces, it has been shown that the solution of problem (2) is block diagonal. However, as SSC finds the sparsest representation of each data vector individually, there is no global constraint on its solution, and therefore the correctness of SC is not guaranteed. On the other hand, low-rank representation (LRR) [11] aims to consider global structure of the data by finding a low-rank representation instead of a sparse representation. The original problem of LRR is formulated as: min rank(Z), Z

s.t. X = XZ.

(3)

Due to the discrete nature of the rank function, problem (3) is hard to be solved. Therefore, LRR employs the nuclear norm as a surrogate of the rank function. Furthermore, to improve robustness of LRR for noises and outliers, an error term k · k1,2 is added to problem (3). The final LRR model becomes min kEk1,2 + λkZk∗ , Z,E

s.t. X = XZ + E.

7

(4)

In many cases, LRR can outperform SSC, but its computational complexity is higher due to the required computation of singular value decomposition (SVD) of Z at each iteration, and also it is unable to be separated and parallelly computed. Moreover, for problem (4), there is no theoretical analysis about the importance of low rank property of the representation matrix Z for subspace clustering. In many practical scenarios, its solution may be quite dense and far from block diagonal [9, 35]. 2.2. Sequential subspace clustering SSC and LRR enforce a sparse or low-rank constraint on the representation matrix, yet for time series data, they do not go further to explore and make use of prior temporal information residing in data. For sequential data clustering, SpatSC method proposed by Guo et al. [25] extends the standard SSC method by incorporating an extra penalty term. Its model is 1 kX − XZk2F + λ1 kZk1 + λ2 kZRk1 Z 2 s.t. diag(Z) = 0,

min

(5)

where λ1 and λ2 are the balance weights. R is a lower triangular matrix, to enforce the consistency of the sparse representation of the sequential data.   −1 0 0 ··· 0  1 −1 0 · · · 0     0 1 −1 · · · 0    R =  .. . (6) ..  .. . . . .  . . .  . .    0 0 0 · · · −1  0 0 0 ··· 1 n×(n−1) Let Z = [z1 , z2 , · · · , zn ]. From the expression of ZR = [z2 − z1 , z3 − z2 , · · · , zn − zn−1 ], we can find the goal of the third term kZRk1 is to encourage consecutive columns of Z to be similar. But, it imposes the similarity at level of individual data components, so fails to directly penalize whole column similarity. To address this problem, OSC method [26, 27] adopts the `1,2 norm in 8

the third term instead of `1 norm, and gets the following model. 1 kX − XZ k2F + λ1 kZk1 + λ2 kZRk1,2 Z 2 s.t. diag(Z) = 0. min

(7)

The penalty term kZRk1,2 is stronger than kZRk1 to enforce column similarity of Z. However, there are two shortcomings for OSC. First, The ADMM method used to solve problem (7) in [26] is not guaranteed to converge, and also suffers from huge computational cost which is due to that one has to solve a large Sylvester equation system at each iteration. Though improved algorithms based on LADMAP and LADMPSAP were proposed in [27], their convergence analysis is quite complex and the computational costs are still Pn−1 kzk+1 −zk k2 is unable to obtain expensive. Second, the term kZRk1,2 = k=1 a discriminative representation matrix Z. As we know, two adjacent frames drawn from the same subspace often exhibit stronger correlation than two from different subspaces. But kZRk1,2 ignores this correlation between temporal neighborhood, because it penalizes the distance of every two adjacent frames evenly. Several improved models have been proposed to alleviate the drawbacks of OSC on capturing the temporal correlation of data. Order subspace clustering with block-diagonal priors (BD-OSC) [24] was proposed by imposing the block-diagonal prior on the affinity matrix Z. Instead of assuming data self-representation, Li et al. [30] presented a temporal subspace clustering (TSC) algorithm by learning a non-negative dictionary from data with a temporal Laplacian regularization. The resulted model is min kX − DZ k2F + λ1 kZk2F + λ2 f (Z) Z,D

s.t. Z ≥ 0, D ≥ 0, kdi k22 ≤ 1, i = 1, 2, · · · , r,

(8)

where the non-negative constraints Z ≥ 0 and D ≥ 0 are to ensure the learned bases and corresponding representation codings have non-negative values, the constraint kdi k22 ≤ 1 to control the model complexity, and f (Z) is the temporal Laplacian regularization function. Given a representation matrix Z, it is defined as: n

n

1 XX wij kzi − zj k22 = tr(ZLZT ), f (Z) = 2 i=1 j=1 9

(9)

where L is aP temporal Laplacian matrix, L = D − W, D is a diagonal matrix with Dii = ni=1 wij , W is a proper weight matrix for capturing the temporal relationships in X. The TSC method can capture the temporal information of data by the Laplacian regularization. 3. Sequential subspace clustering via `1 -norm temporal graph In this section, we introduce the newly proposed method SSC-L1TG. 3.1. A new `1 -norm temporal graph Suppose we have a set of sequential data X = [x1 , · · · , xn ] ∈ Rd×n (the i-th column is sampled at time ti ) drawn from a union of c subspaces of unknown dimensions. In the last section, it has been shown that incorporating temporal information of sequential data improves the performance of clustering methods, as demonstrated by OSC and TSC, over their original models SSC and LRR. To promote order relations responded in the representation matrix, OSC and TSC introduce a constraint term to take care of it. Given a representation matrix Z, we consider the following weighted temporal smoothness constraint: f (Z) =

n−1 X i=1

wi kzi+1 − zi k2 = kZRWk1,2 ,

(10)

where R is defined as in Eq. (6), and W is a diagonal matrix with Wii = wi , i = 1, 2, · · · , n − 1. Denote a (n − 1)-dimensional vector by y, where the i-th element of y is wi kzi+1 − zi k2 , we can rewrite the above term as f (Z) = kyk1 , i.e., the `1 norm of y. Here, we only consider temporally neighboring points in a graph. Thus like in [31], we term it the `1 -norm temporal graph. Our motivation is that similar neighboring points should have similar representing coefficients, such that they have a larger probability belonging to the same subspace. In particular, if xi+1 and xi are close, then their corresponding codings, zi+1 and zi , should be close too. To this end, a proper weight matrix can be constructed by using Euclidean distance,   kxi+1 − xi k22 , (11) wi = exp − t 10

(a) Unsmoothed wi

(b) Smoothed wi by BF

(c) wi ≡ 1

Figure 1: The weights used for a sequential data consisting of three different motion sequences (i.e., jack, run and walk) described in Section 4.4: (a) the rough weights computed by Eq. (11), (b) smoothed version of (a) by BF, (c) wi ≡ 1 used by OSC. In (a), we can see the weight curve has two spike-like structures with very small values in the center, while has oscillating structures with high values at the other places. After performed by BF, the curve in (b) becomes more smooth in the flat areas, meanwhile its spike-like structures are well preserved.

where t is a tuning parameter, empirically set as the mean of kxi+1 − xi k22 , i = 1, 2, · · · , n − 1. The weight wi,i+1 in Eq. (11) depends on the distances between neighboring sample points, but of course we can define it in other ways, for instance, the angles between the normalized data points used in [36, 37]. Therefore, minimizing f (Z) with our choice of the weights (11) incurs a heavy penalty if neighboring points xi+1 and xi with small Euclidean distance are encoded far apart. Furthermore, after obtaining the weights by Eq. (11), we resort to the technique of bilateral filtering (BF) [38] to smooth the weights while preserving sharp edges where the intersection of two neighboring subspaces is probably located. Fig. 1 (a)(b) shows an example, where by BF, the oscillating structures are well suppressed, while the most important spike-like edges are retained. Such features are what the ideal weights should have. Note that the locations of needle-like points are approximately at the intersection of two distinct subspaces. Thus, the smoothed weights ensure that by minimizing Eq. (10), the neighboring points drawn from the same subspace have similar codings, while the neighboring points from different subspaces are encoded far apart. 3.2. Our model We also assume that each data point can be written as a sparse linear combination of all the other points. So, by combining SSC with the `1 -norm 11

temporal graph, our objective function is achieved as follows: min Z

1 kX − XZk2F + λ1 kZk1 + λ2 kZRWk1,2 , 2

(12)

where λ1 and λ2 are trade-off parameters to balance different terms. We name the above model as the SSC method with a `1 -norm temporal graph (SSC-L1TG). After we find the data representation matrix Z by solving the optimization problem (12), it is utilized to define the affinity matrix of an T undirected graph in the way of |Z |+|Z| and then spectral clustering algorithms 2 such as Ncut [15] are employed to segment the data into clusters. Our model is quite relevant to OSC and TSC, but they have significant differences in the temporal constraint term, which makes our model more competitive. First, our model is more discriminative than OSC to learn temporal information. Note that OSC is a special case of our model when taking wi ≡ 1, i.e., W = In−1 . OSC penalizes the difference of zi+1 −zi evenly, which ignores the similarity information between xi+1 and xi . In contrast, our model adopts the weight (11) including the adaptive distance penalty. As explained in Fig. 1 and Section 3.1, it is able to enhance discriminative power of the learned affinity matrix. Moreover, the adaptive weight W will prevent SSC-L1TG to get the trivial solution Z = I. The reason is that for the proposed weight, there must exist certain index i0 such that wi0 6= wi0 +1 . Then, minimizing f (Z) in Eq. (10) will make kzi0 +1 −zi0 k2 6= kzi0 +2 −zi0 +1 k2 , and so Z 6= I. Therefore, we remove the constraint diag(Z) = 0. Second, Our model is more robust than TSC to data outliers. From the self-expression X = XZ, data with noises and outliers will cause the coding matrix Z with errors. Compared with TSC which incorporates the squared Frobenius-norm in Eq. (9), our model employs the Frobenius-norm in Eq. (10) and has the ability to weaken the effect of noises and outliers due to large errors are not exaggerated [39]. 3.3. Optimization In this section, we develop a proximal algorithm to solve the optimization problem (12). For simplicity of exposition, we set A := XT , B := WT RT and Y := ZT .

12

Then the optimization problem (12) can be equivalently written as min Y

1 kA − YAk2F + λ1 kYk1 + λ2 kBYk2,1 . 2

(13)

The objective function of this problem is convex and coercive thanks to the involved norms are convex and coercive. Hence, a solution of problem (13) exists and can be characterized in terms of proximity operator. To this end, we review the definitions of proximity operator. Let E be a vector space. By Γ0 (E) we denote the class of all lower semicontinuous convex functions ψ : E → (−∞, +∞] such that dom(ψ) := {x ∈ E : ψ(x) < +∞} = 6 ∅. For a function ψ ∈ Γ0 (E), the proximity operator of ψ with respect to a given matrix D ∈ S+ at x ∈ E is defined by 1 proxψ,D (x) := argmin{ ku − xk2D + λψ(u) : u ∈ E}. 2

(14)

In particular, we write proxψ for proxψ,I . It is also easy to see proxλ−1 ψ (x) = proxψ,λI (x). The proximity operator of a function is closely related to its subdifferential. The subdifferential of ψ at x is the set defined by ∂ψ (x) := {y ∈ E : ψ(x) ≥ ψ(z) + hy, z − xi, ∀z ∈ E}.

(15)

Next, we recall some key facts in convex analysis concerning the proximity operator and subdifferential of a convex function, which serves as main tools for our algorithmic development. Lemma 3.1. Let h : E → (−∞, +∞] be a proper convex function. then for any u, x ∈ E, (i) the set ∂h (x) is closed and convex ([32] Theorem 3.9); (ii) x∗ ∈ argmin{h(x) : x ∈ E} if and only if 0 ∈ ∂h (x∗ ) (Fermat rule, [32] Theorem 3.63); (iii) x = proxh,D (x + u) if and only if Du ∈ ∂h (x) ([33]); (iv) given λ > 0, proxλh (x) + λproxλ−1 h∗ ( xλ ) = x where h∗ is the conjugate function of h defined at y ∈ E by h∗ (y) := sup{hx, yi − h(x) : x ∈ E} (Moreau identity, [32] Theorem 6.45); (v) u ∈ ∂h (x) if and only if x ∈ ∂h∗ (u) ([40] Proposition 16.9); (vi) the subdifferential of h is a monotone operator, i.e., if g1 ∈ ∂h (x), g2 ∈ ∂h (u), then hx − u, g1 − g2 i ≥ 0 ([41] Lemma 3). 13

By setting E = Rn×d , the following theorem characterizes the solution of optimization problem (13) that is simply derived from Lemma 3.1. Theorem 3.2. Let A ∈ Rn×d , λ1 > 0, λ2 > 0. If Y ∈ Rn×n is a solution of problem (13), then for any µ > 0 and γ > 0, there exists a matrix U ∈ R(n−1)×n such that  U = proxµ−1 (λ2 k·k2,1 )∗ U + µ−1 BY , (16)  (17) Y = proxλ1 γ −1 k·k1 −γ −1 BT U + Y + γ −1 (A − YA)AT . Conversely, if there exist µ > 0 and γ > 0 such that Y ∈ Rn×n , U ∈ R(n−1)×n satisfy Eqs. (16) and (17), then Y is a solution of problem (13). Proof. see Appendix A for details. From Theorem 3.2, finding a solution Y to problem (13) amounts to solving the coupled fixed-point equations (16) and (17). Hence, we design an iterative algorithm to solve them. First, let us rewrite Eq. (17) as  Y = proxλ1 γ −1 k·k1 −γ −1 BT (2U − U) + Y + γ −1 (A − YA)AT . (18)

With any initial estimates U(0) and Y(0) , the iterative scheme based on Eqs. (16) and (17) is as follows. (  U(k+1) = proxµ−1 (λ2 k·k2,1 )∗ U(k) + µ−1 BY(k) ,  Y(k+1) = proxλ1 γ −1 k·k1 −γ −1 BT (2U(k+1) − U(k) ) + Y(k) + γ −1 (A − Y(k) A)AT . (19) In order to perform Eq. (19), we need to evaluate proxµ−1 (λ2 k·k2,1 )∗ and proxλ1 γ −1 k·k1 at each iteration. The Moreau identity is applied to evaluate proxµ−1 (λ2 k·k2,1 )∗ . That is, for any V ∈ R(n−1)×n , proxµ−1 (λ2 k·k2,1 )∗ (V) = V − µ−1 proxλ2 µk·k2,1 (µV), where proxλ2 µk·k2,1 is a `2,1 -norm minimization operator [11]. The evaluation of proxλ1 γ −1 k·k1 amounts to computing the soft-thresholding operator [42]. Both the `2,1 -norm minimization and soft-thresholding operators are defined in Appendix B. The complete optimization to problem (13) is summarized in Algorithm 1.

14

Algorithm 1. Proximity algorithm for solving problem (12) Input: data matrix X ∈ Rd×n , parameters λ1 , λ2 , µ, γ > 0; Output: Z ; 1: Initialize: Set k = 0, U(0) ∈ R(n−1)×n , Y(0) ∈ Rn×n ; Compute R and W; Smooth diag(W); 2: while not converge do  3: U(k+1) ← proxµ−1 (λ2 k·k2,1 )∗ U(k) + µ−1 BY(k) ; e (k+1) ← −BT (2U(k+1) − U(k) ) + (A − Y(k) A)AT ; 4: Y  e (k+1) ; 5: Y(k+1) ← prox λ1 k·k1 Y(k) + γ −1 Y γ

6: k ← k + 1; 7: end while 8: Let (U(∞) , Y(∞) ) be the last result computed above. Set Z := (Y(∞) )T .

Similarly, if we rewrite Eq. (17) as  U = proxµ−1 (λ2 k·k2,1 )∗ U + µ−1 B(2Y − Y) ,

we can get another iterative scheme as below. (  Y(k+1) = proxλ1 γ −1 k·k1 −γ −1 BT U(k) + Y(k) + γ −1 (A − Y(k) A)AT ,  U(k+1) = proxµ−1 (λ2 k·k2,1 )∗ U(k) + µ−1 B(2Y(k+1) − Y(k) ) .

(20) After finding the representation matrix Z by Algorithm 1, an affinity T| graph can be defined as |Z|+|Z . Finally, Ncut [15] is used to get the clustering 2 results. Algorithm 2 summarizes the whole subspace clustering procedure. 3.4. Convergence and complexity analysis In this section, we first focus on the convergence analysis of the iterative scheme (19). To do this, the following lemma is necessary.

15

Algorithm 2 SSC-L1TG Input : Sequential data X ∈ Rd×n , drawn from a union of c linear subspaces. 1: Obtain the optimal coefficient representation Z of X by Algorithm 1; T

| 2: Create affinity matrix A using Z, i.e., A = |Z|+|Z . 2 Then, an undirected graph can be obtained; 3: Apply Ncut to segment the vertices of the undirected graph into c clusters; Output : The clustering results of X.

Lemma 3.3 ([33], Lemma 3.1). Suppose that h : E1 × E2 → (−∞, ∞] is given by h(x1 , x2 ) = h1 (x1 ) + h2 (x2 ). Then, the proximity operator of h with respect to a given matrix D = diag(D1 , D2 ) ∈ S+ is  proxh,D (x1 , x2 ) = proxh,D1 (x1 ), proxh,D2 (x2 ) . Denote M := (U, Y) ∈ R(n−1)×n × Rn×n and then define

Θ(M) := (λ2 kUk2,1 )∗ + λ1 kYk1 , and N := diag(µIn−1 , γIn ). Thus, by Lemma 3.3, we have   proxΘ,N (M) = proxµ−1 (λ2 k·k2,1 )∗ (U), proxγ −1 λ1 k·k1 (Y) .

Furthermore, if we define     In−1 µ−1 B 0 0 H := , H0 := −γ −1 BT In −2γ −1 BT 0   0 H1 := H − H0 , c(M) := γ −1 (A − YA)AT we then can respectively rewrite Eqs. (16), (17) and (19) in a more compact form as M = proxΘ,N (HM + c(M)) , (k+1)

M

= proxΘ,N H0 M

(k+1)

16

+ H1 M

(k)

(k)



+ c(M ) .

(21) (22)

2n−1 Lemma 3.4. Assume M∗ is a fixed-point of Eq. (21). If NH1 ∈ S+ , then there holds

1 hM(k+1) − M∗ , M(k+1) − M(k) iNH1 ≤ kM(k+1) − M(k) k2L , 2

(23)

where L = diag(0, (σ1 (A))2 In ) ∈ R(2n−1)×(2n−1) and σ1 (A) is the largest singular value of A. Proof. see Appendix C for details. We can now show that under certain conditions with any given initial guess M(0) the Picard iteration {M(k) } converges to a fixed point satisfying Eq. (21). 2n−1 Theorem 3.5. If NH1 − L ∈ S+ , then with any given initial guess M(0) , (k) the Picard iteration {M } converges.

Proof. see Appendix C for details. 2n−1 Theorem 3.6. If NH1 − L ∈ S+ , then with any given initial guess M(0) , the Picard iteration (22) provides a sequence converging to a fixed point satisfying (21); that is, converging to a solution of the problem (13).

Proof. According to Lemma 3.1(i), the operator ∂Θ has a closed graph. Thus, let k in Eq. (30) tend to infinity and use H = H0 +H1 , the result is obtained. We analyze the computational complexity of Algorithm 1. Given a data matrix X ∈ Rd×n , Step 1 involves an operation to smooth the weight vector by BF whose computational complexity is O(1) or constant time [43]. We can also perform it before iteration because it is only relevant with the input data. Step 4 involves matrix multiplications with a complexity of O(n3 + n2 d + nd2 ), while Step 3 and Step 5 involve computing the element-wise `2,1 norm minimization operator and the soft thresholding operator, which have a complexity of at most O(n3 ). Therefore, the total complexity of Algorithm 1 is O(T (n3 + n2 d + nd2 )), where T represents the number of iterations. 4. Experiments In this section, the performance of our proposed SSC-L1TG is compared to the state-of-the-art methods, such as SSC [2], LRR [11], SpatSC [25], TSC 17

[30] and OSC [26]. For evaluating performance consistently, Ncut [15] is used for final segmentation for each method in all experiments. First, a synthetic data set in [8] is used to test the methods in ideal situation. Then, some real-word datasets are used to test the clustering performance in complex conditions. All experiments are carried out by using MATLAB on a PC with Intel(R) Core(TM) i5-4590 (3.30GHz) CPU and 8GB RAM. For those compared methods, we use the codes provided by the authors, and fine tune the parameters to achieve the best performance. In the experiment, we use the same subspace clustering error (SCE) metric as in [44] to measure the clustering results. SCE is defined as n

1X δ(pi , map(qi )), SCE = 1 − n i=1

(24)

where pi and qi represent the output label and the ground truth one of the i-th point respectively, δ(x, y) = 1 if x = y, and δ(x, y) = 0 otherwise, and map(qi ) is the best mapping function that permutes clustering labels to match the ground truth labels. To evaluate all the methods holistically, we also provide minimum, maximum, median and mean data on clustering error for each experiment. In addition, we evaluate robustness of methods by adding extra noises to each dataset. In all experiments, we use Gaussian noise with zero mean and unit variance. We modify the noises by changing magnitudes and provide clustering results for each magnitude of the noises. We report the level of the noises using Peak Singal-to-Noise Ratio (PSNR) [26] defined as ! s2 Pm Pn (25) PSNR = 10log10 1 2 j (Iij − Kij ) i mn where I is the noise free data, K = I + N is a noisy data and s is the maximum possible value of an element of I. Note that decreasing values of PSNR means increasing amounts of noise.

4.1. Parameter selections Before doing experiments, we discuss the selection of the parameters included in Algorithm 1. First, to guarantee the convergence of Algorithm 1,

18

(a) Video dataset

(b) Weizmann dataset

(c) Yale B dataset

Figure 2: The sensitivity of SSC-L1TG to the model parameters λ1 and λ2 on different databases.

we need NH1 − L =



µIn−1 B T B (γ − σ1 (A)2 )In



2n−1 ∈ S+ .

In light of [33] Lemma 6.2, it is equivalent to σ1 (B)2 < (γ − σ1 (A)2 )µ. Therefore, we empirically choose µ and γ as γ = 1.2 σ1 (A)2 +

σ1 (B)2  . µ

(26)

Moreover, for acceleration, we update µk+1 = min(ρµk , µmin ) with 0 < ρ < 1 and µmin > 0 a lower bound, which is inspired by [41]. Next, we test the sensitivity of SSC-L1TG to the model parameters λ1 and λ2 . We tune the parameters within the range of [10−3 , 10−2 , 10−1 , 0.5, 1.0, 10, 102 , 103 ] and show the segmentation accuracy (ACC) 1 obtained by SSC-L1TG on different datasets. From Fig. 2, we can conclude the clustering accuracy on these datasets is affected by λ1 and λ2 . In particular, we observe that the satisfactory clustering results can be obtained by taking (λ1 , λ2 ) = (0.5, 0.5) for the Video dataset, (0.01, 0.5) for the Weizmann dataset and (0.5, 0.5) for the Yale B dataset, respectively. 1

ACC = 1 - SCE.

19

Table 1: Subspace clustering error (%) for the synthetic data set with various magnitudes of Gaussian noise, lower is better.

PSNR Max

91

71

51

31

SCE SSC LRR SpatSC OSC TSC SSC-L1TG Min 0 1.00 0 0 9.00 0 Max 0 27.00 0 0 29.00 0 Med 0 14.00 0 0 17.00 0 Mean 0 12.05 0 0 19.40 0 Min 0 7.00 0 0 13.00 0 Max 0 27.00 0 0 29.00 0 Med 0 14.00 0 0 25.00 0 Mean 0 15.67 0 0 22.00 0 Min 0 3.00 0 0 13.00 0 Max 0 30.00 0 0 28.00 0 Med 0 12.00 0 0 23.00 0 Mean 0 13.78 0 0 22.44 0 Min 0 9.00 0 0 13.00 0 Max 0 33.00 0 0 23.00 0 Med 0 18.00 0 0 19.50 0 Mean 0 19.50 0 0 18.75 0 Min 6.00 15.00 0 0 10.00 0 Max 19.00 35.00 24.00 24.00 29.00 0 Med 10.00 27.00 0 0 25.00 0 Mean 15.12 24.57 3.43 3.43 23.14 0

In Algorithm 1, we adopt the following stopping criterion: kY(k+1) − Y(k) k2F < , kY(k+1) k2F with  = 10−4 . 4.2. Synthetic Experiment In this part, evaluation is performed using randomly generated subspace structured data. Similar to [8, 27], 5 subspaces {Si }5i=1 are constructed whose based {Ui }5i=1 are computed by Ui+1 = TUi , 1 ≤ i ≤ 4, where T is a random rotation matrix and Ui is a random orthonormal basis of dimension 100 × 5. In other words, each basis is a random rotation away from the previous basis and the dimension of each subspace is 5. 20 data points are sampled from each subspace by Xi = Ui Qi where Q ∈ R5×20 is a random gaussian multivariate matrix. This mimics the assumption that consecutive data points 20

(a) SSC

(b) LRR

(c) SpatSC

(d) OSC

(e) TSC

(f) SSC-L1TG

Figure 3: The affinity matrix of the synthesized data is obtained without adding noise.

(a) SSC

(b) LRR

(c) SpatSC

(d) OSC

(e) TSC

(f) SSC-L1TG

Figure 4: Clustering results from affinity matrices in Figure 3. SSC-L1TG, SSC, OSC and SpatSC achieve perfect segmentation, while LRR and TSC suffer from misclassification.

within the same subspace are similar to each other. Finally, the data is concatenated X = [X1 , · · · , X5 ]. We repeated the experiment 50 times with new random bases and coefficient matrices each time. Furthermore, we repeated the experiment with various levels of noise to determine robustness. The clustering results are reported in Table 1. It is worth noting that our method always obtain the

21

Table 2: Subspace clustering error (%) for the video segmentation dataset [45] with various magnitudes of Gaussian noise, lower is better. Experiments show that SSC-L1TG achieve better segmentation.

PSNR Max

81

41

27

SCE Min Max Med Mean Min Max Med Mean Min Max Med Mean Min Max Med Mean

SSC 51.82 60.07 55.11 55.16 54.46 55.78 54.46 54.90 49.83 55.07 51.82 53.00 49.50 54.79 52.15 52.15

LRR 51.82 53.80 53.47 53.14 53.14 53.80 53.14 54.36 53.14 57.43 54.46 54.72 40.26 41.58 40.92 40.92

SpatSC 54.46 60.40 60.07 58.64 56.11 60.40 56.77 57.36 53.80 60.07 55.45 56.63 54.13 57.10 55.12 55.31

OSC 54.46 59.41 56.77 57.04 54.13 58.75 56.77 56.30 55.12 59.08 57.10 57.43 54.79 60.07 55.78 57.03

TSC SSC-L1TG 33.00 16.17 58.75 20.79 40.59 16.83 42.52 17.76 22.77 15.84 46.53 19.80 34.65 17.16 35.91 17.49 29.04 14.85 44.22 18.15 35.64 15.18 36.24 16.13 34.65 12.54 50.50 15.18 41.25 13.86 41.45 13.99

true segmentation regardless of PSNR, which is more accurate than other methods although SSC, SpatSC and OSC also perform well when PSNR is not less than 51. We also provide visual comparisons of affinity matrix Z in Fig. 3 and clustering results in Fig. 4 for the corresponding affinity matrix. Visually we observe our method provides an affinity matrix which is block diagonal and contains more denser values within the diagonal blocks. Fig. 4 further verifies our observation. 4.3. Video Segmentation The aim of this experiment is to segment individual scenes from a video sequence. We test all the methods on the video dataset [45] 2 , which contains 10 videos. We randomly selected three sequences for segmentation. The preprocessing of each sequence consists of converting colour video to grayscale and sampling to a resolution of 288 × 352. Each frame in the sequence is 2

http://www.kecl.ntt.co.jp/people/kimura.akisato/saliency3.html

22

(a) SSC

(b) LRR

(c) SpatSC

(d) OSC

(e) TSC

(f) SSC-L1TG

Figure 5: The affinity matrix of the video data is obtained without adding noise.

(a) SSC

(b) LRR

(c) SpatSC

(d) OSC

(e) TSC

(f) SSC-L1TG

Figure 6: Clustering results from affinity matrices in Figure 5. SSC-L1TG achieves better segmentation, while SSC,LRR, SpatSC, TSC and OSC suffer from misclassification.

then vectorised to xi ∈ R101376 and concatenated with consecutive frames to form X ∈ R101376×303 . Furthermore, each sequence was corrupted with various magnitudes of gaussian noise. We repeated the experiment 10 times with each magnitude. Results about clustering errors are reported in Table 2. Again our method outperforms other methods which can find the true segmentation. To show it clear, we give

23

Table 3: Subspace clustering error (%) for the 3 action sequences with various magnitudes of Gaussian noise, lower is better, experiments show that SSC-L1TG achieve better segmentation.

PSNR Max

40

20

5

SCE Min Max Med Mean Min Max Med Mean Min Max Med Mean Min Max Med Mean

SSC 20.65 20.65 20.65 20.65 20.65 21.20 21.20 21.06 22.28 22.72 21.88 22.36 20.21 20.44 20.11 20.22

LRR SpatSC OSC 27.72 17.93 23.91 27.72 17.93 23.91 27.72 17.93 23.91 27.72 17.93 23.91 27.72 17.93 23.91 27.72 19.02 23.91 27.72 18.21 23.91 27.72 18.32 23.91 39.67 13.59 19.57 39.67 19.57 36.96 39.67 16.85 24.18 39.67 16.30 24.80 61.94 44.57 44.02 62.50 50.00 49.46 62.50 48.60 48.37 61.96 47.96 47.69

TSC SSC-L1TG 21.74 7.61 94.13 7.61 45.11 7.61 43.48 7.61 22.83 7.61 62.50 7.61 44.84 7.61 44.29 7.61 22.83 7.61 61.41 11.96 44.02 9.24 41.44 9.31 34.78 2.72 46.74 16.30 42.93 8.15 42.46 8.76

visual comparisons of affinity matrices Z and the corresponding clustering results in Fig. 5 and Fig. 6. 4.4. Action Clustering This experiment evaluates the action clustering performance of our approach and the compared methods on the Weizmann dataset 3 . The action data in the Weizmann dataset [46] are organized in isolated clips, which provides an ideal controlled evaluation platform for temporal clustering. This dataset contains 90 video sequences (180 × 144 pixels, 50fps) captured from 9 subjects. Each subject performs 10 different actions, including jumping-jack (or shortly jack), jump-forward-on-two-legs (jump), jump-in-place-on-twolegs (jump), gallop-sideways (side), bend, skip, walk, run, wave-one hand (wave1), and wave-two-hands (wave2). Because there is no provided frame by frame ground truth for this dataset, we generate our ground-truth segmentation by concatenating a set of motion clips each of which shares a single semantic. 3

http://www.wisdom.weizmann.ac.il/ vision/SpaceTimeActions.html

24

(a) SSC

(b) LRR

(c) SpatSC

(d) OSC

(e) TSC

(f) SSC-L1TG

Figure 7: The affinity matrix of the three classes action data is obtained without adding noise.

(a) SSC

(b) LRR

(c) SpatSC

(d) OSC

(e) TSC

(f) SSC-L1TG

Figure 8: Clustering results from affinity matrices in Figure 7. SSC-L1TG achieve better segmentation.

Specifically, we test all the methods on two settings which consist of 3 action clips (e.g. jack, run, walk) and 6 action clips (e.g. bend, jack, jump, run, walk, jump), respectively. Similar to previous experiments, we add increasing gaussian noise and repeat the experiments 10 times for each magnitude of noise. The statistical results of segmenting 3 action sequences

25

Table 4: Subspace clustering error (%) for the 6 action sequences with various magnitudes of Gaussian noise, lower is better, experiments show that SSC-L1TG achieve better segmentation.

PSNR Max

122

82

42

SCE Min Max Med Mean Min Max Med Mean Min Max Med Mean Min Max Med Mean

SSC 20.79 21.05 20.79 20.86 20.79 21.32 20.79 20.96 20.79 26.58 20.79 22.05 20.26 21.05 20.79 20.68

LRR SpatSC OSC 41.84 45.26 36.84 42.36 44.21 36.84 42.11 49.61 36.84 42.05 49.67 36.84 41.84 45.26 36.84 44.95 54.21 36.84 44.16 45.26 36.84 42.89 48.25 36.84 42.95 45.26 36.84 49.16 54.21 36.84 46.05 54.21 36.84 43.84 50.63 36.84 43.42 53.95 36.05 47.37 54.21 36.05 47.11 54.21 36.05 45.84 54.11 36.05

TSC SSC-L1TG 28.95 17.37 75.26 19.21 62.63 18.42 57.37 18.36 28.95 18.68 33.68 19.21 30.79 18.68 31.14 18.86 24.74 17.37 48.68 19.21 28.42 18.68 31.96 18.53 28.68 17.11 46.32 19.47 30.79 18.42 33.89 18.26

are reported in Table 3, and the corresponding visual results are reported in Fig. 7 and Fig. 8. While the statistical results of segmenting 6 action sequences are reported in Table 4, and the corresponding visual results are reported in Fig. 9 and Fig. 10. Specifically, Fig. 8 and Fig. 10 show that SSC and especially LRR can not obtain meaningful temporal segments, as they do not take into account the temporal information. SpatSC, OSC, TSC and our SSC-L1TG methods can obtain continuous segments in most cases. Moreover, because of use of the distance dependent weights, our approach is able to generate the affinity matrices with clearer diagonal structures in Fig. 7 and Fig. 9, and therefore achieves lower SCE than other methods in Table 3 and Table 4. 4.5. Face Clustering The aim of this experiment is to cluster unique subjects from a set of face images. As remarked in [26], we ensure that in our tests the faces are kept contiguous, i.e., unique subjects do not mix, so that this task is more suited to the assumption of clustering sequential data. It is reasonable to exploit spatial information since neighbours of each data vector have high 26

(a) SSC

(b) LRR

(c) SpatSC

(d) OSC

(e) TSC

(f) SSC-L1TG

Figure 9: The affinity matrix of the six classes action data is obtained without adding noise.

(a) SSC

(b) LRR

(c) SpatSC

(d) OSC

(e) TSC

(f) SSC-L1TG

Figure 10: Clustering results from affinity matrices in Figure 9. SSC-L1TG achieve better segmentation.

Figure 11: Face clustering: given an ordered set of face images, the goal is to cluster images that belong to the same subject.

27

Table 5: Subspace clustering error (%) for face data with various magnitudes of Gaussian noise, lower is better, erperiments show that SSC-L1TG achieve better segmentation.

PSNR Max

85

45

30

SCE Min Max Med Mean Min Max Med Mean Min Max Med Mean Min Max Med Mean

SSC 60.00 61.67 61.67 61.39 60.00 61.67 61.67 61.22 60.00 63.33 61.67 61.97 60.00 63.33 60.00 60.28

LRR 57.67 57.67 57.67 57.67 60.00 61.33 60.00 60.53 60.33 60.33 60.33 51.11 58.33 60.00 60.00 59.56

SpatSC 61.67 61.67 61.67 61.67 61.67 61.67 61.67 61.67 61.67 61.67 61.67 61.67 56.67 63.33 61.67 60.56

OSC 55.00 65.00 65.00 63.33 55.00 65.00 65.00 63.33 55.00 65.00 65.00 61.67 63.33 63.33 63.33 63.33

TSC SSC-L1TG 5.00 0 53.33 0 43.33 0 37.50 0 40.00 0 55.00 0 47.50 0 47.78 0 35.00 0 53.33 0 44.17 0 43.89 0 40.00 1.67 55.00 3.33 45.00 1.67 46.11 2.22

probability to belong to the same subject. We use the Extended Yale B dataset [47] to carry out experimental setting. This dataset contains 16128 images of 28 human subjects under 9 poses and 64 illumination conditions. Some example images are shown in Fig 11. For each test we randomly pick up 3 subjects from the dataset and then 20 sample images are randomly selected in each subject. To highlight the efficiency of our method, the images directly form data vectors xi ∈ R307200 without doing dimensionality reduction. Then we concatenate them together in order to ensure that subjects are mixed. We repeated these tests 30 times for each level of corruption by Gaussian noise. Statistical results of subspace clustering error can be found in Table 5. We observed that our method SSC-L1TG outperforms other methods in all cases of PSNR. It is also verified visually in Fig. 12 and Fig. 13, where the true segmentation is achieved by our method. 4.6. Convergence study and computational time To show the convergence performance of SSC-L1TG in Algorithm 1, we plot the evolving curves of our objective function shown in Eq. (13) with respect to the numbers of iterations in Fig. 14. From this figure, we can 28

(a) SSC

(b) LRR

(c) SpatSC

(d) OSC

(e) TSC

(f) SSC-L1TG

Figure 12: The affinity matrix of the face data is obtained without adding noise.

(a) SSC

(b) LRR

(c) SpatSC

(d) OSC

(e) TSC

(f) SSC-L1TG

Figure 13: Clustering results from affinity matrices in Figure 12. SSC-L1TG achieve perfect segmentation.

see that the proposed optimization Algorithm 1 is effective and converges quickly, usually after less than 50 iterations. Moreover, we compare the computational times (in seconds) of different methods. Since all the methods apply Ncut to segment data after the affinity matrices are obtained, for fairly comparison, we just compare the computational time taken to calculate the representation matrix Z. From bar charts 29

(a) Video dataset

(b) Six-class Weizmann dataset

(c) Three-class Weizmann dataset

(d) Yale B dataset

Figure 14: Convergence behavior of Algorithm 1 on three data sets.

in Fig. 15, we can see our method runs the fastest in all data sets. It is reasonable because our method is free of SVD and solving the equation system. SSC, LRR and TSC run very slow because SSC converges slowly, LRR includes SVD while TSC requires to learn a dictionary in each iteration. The complexity of TSC is O(nr2 ) in each iteration, which is likely less than SSCL1TG, but much more iterations are required by TSC. Thus it still performs more slowly than SSC-L1TG. Overall, our method is more efficient that the others. 4.7. Discussions We first discuss the effect of the proposed weight to clustering results. 2 Note that though the Gaussian function exp(− xt ) in Eq. (11) can reduce the impact of the noises and outliers with large values to some extent, it may not be enough. So smoothing is further done by using bilateral filter (BF). 30

(a) Video dataset

(b) Six-class Weizmann actions

(c) Three-class Weizmann actions

(d) Yale B dataset

Figure 15: Comparison of computational time (in seconds). We can intuitively find that the computational time taken by SSC-L1TG is the least.

Fig. 16 shows the clustering results corresponding to the weights in Fig. 1. We can find the clustering results in (a) and (b) by respectively using wi and smoothed wi are better than the result in (c), which is produced by using constant weight wi ≡ 1. Meanwhile, the clustering result in (b) is better than that in (a), because compared with (b), samples belonging to a different class (green) are mixed into another class (blue). In a word, the clustering results of the proposed method are better than OSC, and can be improved by smoothing weights. Next, we turn to test the sensitivity of the SSC-L1TG method to the change of class number and PSNR. We generate action sequences by randomly selecting action clips from three, five and six classes in Weizmann dataset. The results are shown in Fig. 17, from which it can be easily seen that the ACC decreases as the class number rises, while keeps well under different PSNR.

31

(a) Unsmoothed wi

(b) Smoothed wi by BF

(c) wi ≡ 1

Figure 16: Clustering results corresponding to the weights shown in Fig. 1: (a) wi in Eq. (11), (b) smoothed wi by BF, (c) constant weight wi ≡ 1 used by OSC.

Figure 17: Sensitivity of the clustering results to the class number and PSNR. Note that decreasing values of PSNR means increasing amounts of noises.

5. Conclusions This paper explores the problem of sequential subspace clustering.We present a novel clustering method (SSC-L1TG) that is based on `1 -norm temporal graph. A proximity algorithm is developed with theoretical convergence analysis. The proposed SSC-L1TG is tested on different types of data, such as synthetic data, video data, human activity data and face data. The experimental results demonstrate that our method outperforms state-ofthe-art methods including SSC, LRR, SpatSC, OSC and TSC. As the weights in temporal graph are simply set as the Euclidean distance between neighboring frames, which is generally sensitive to data noises and outliers, we will explore other advanced methods like multilevel hypergraph to encode temporal similarity in the future work. To improve efficiency, integrating certain dimensional reduction technique like feature selection and principal compo32

nent analysis (PCA), into subspace clustering will be another direction. Acknowledgement This work was partially supported by the National Natural Science Foundation of China (Nos. 61863001, 61502107, 11661007, 61702244, 61962003 and 11761010), the Natural Science Foundation of Jiangxi Province (Nos. 20181BAB202021, 20192BAB205086, JXJG-18-14-11), the Natural Science Foundation of Zhejiang Province (No. LD19A010002), and the research projects of Gannan Normal University (Nos. 18zb04, YCX18B001). Appendix A Proof of Theorem 3.2. First, we assume Y is a solution of problem (13). By Fermat rule and the chain rule of subdifferential, we get 0 ∈ −(A − YA)AT + ∂λ1 k·k1 (Y) + BT ∂λ2 k·k2,1 (BY).

(27)

Then for any µ > 0 and γ > 0, there exists U ∈ ∂λ2 k·k2,1 (BY) ⇔ µ−1 BY ∈ ∂µ−1 (λ2 k·k2,1 )∗ (U) such that −γ −1 BT U + γ −1 (A − YA)AT ∈ ∂λ1 γ −1 k·k1 (Y). Then by Lemma 3.1(iii) with D = I, we obtain Eqs. (16) and (17). Conversely, if Eqs. (16) and (17) holds, we then have U ∈ ∂λ2 k·k2,1 (BY) and −BT U + (A − YA)AT ∈ ∂λ1 k·k1 (Y)

according to Lemma 3.1 (iii)(v). Multiplying BT to both sides of the previous inclusion and using the chain rule ∂(λ2 k·k2,1 ◦B) (Y) = BT ∂λ2 k·k2,1 (BY), we have that BT U ∈ ∂(λ2 k·k2,1 ◦B) (Y). Furthermore, 0 ∈ −(A − YA)AT + ∂λ1 k·k1 (Y) + ∂(λ2 k·k2,1 ◦B) (Y), which shows Y is a solution to problem (13).

33

Appendix B 1. The soft thresholding operator [42] is defined as   x − , if x > , D [x] = x + , if x < −,   0, otherwise,

(28)

where x ∈ R and  > 0 being a small positive number. This operator can be extended to vectors and matrices by applying it element-wisely. Therefore, for any matrix X, we have proxk·k1 (X) = D [X] = (D [Xij ]). 2. The `2,1 -norm minimization operator [8]. Given a matrix Q = [q1 , q2 , · · · , qi , · · · ], if the optimal solution of 1 proxλk·k2,1 (Q) = argmin λkWk2,1 + kW − Qk2F 2 W is W∗ , then the ith column of W∗ is ( W∗ (:, i) =

kqi k−λ qi , kqi k

0,

if λ < kqi k, otherwise.

Appendix C Proof of Lemma 3.4. By Lemma 3.1 (iii), we have N(HM∗ − M∗ + c(M∗ )) ∈ ∂Θ (M∗ ), N(H0 M(k+1) − M(k+1) + H1 M(k) + c(M(k) )) ∈ ∂Θ (M(k+1) ).

(29) (30)

By Lemma 3.1 (vi) again, we get 0 ≤ hM(k+1) − M∗ , N(H0 M(k+1) − M(k+1) + H1 M(k) + c(M(k) )) − N(HM∗ − M∗ + c(M∗ ))i = hM(k+1) − M∗ , N(H − I2n−1 )(M(k+1) − M∗ )i − hM(k+1) − M∗ , NH1 (M(k+1) − Mk )i + hM(k+1) − M∗ , N(c(M(k) ) − c(M∗ ))i 34

where the first term equals 0, while the third term can be simplified by using the Law of Cosines: hM(k+1) − M∗ , N(c(M(k) ) − c(M∗ ))i = h(Y(k+1) − Y∗ )A, −(Y(k) − Y∗ )Ai 1 1 1 = k(Yk+1 − Yk+1 )Ak2F − k(Yk+1 − Y∗ )Ak2F − k(Yk − Y∗ )Ak2F 2 2 2 1 ≤ k(Yk+1 − Yk+1 )Ak2F 2 σ1 (A)2 kYk+1 − Yk+1 k2F . ≤ 2 Then, 1 hM(k+1) − M∗ , M(k+1) − Mk iNH1 ≤ kM(k+1) − M(k) k2L . 2

Proof of Theorem 3.5. Let M∗ ∈ R(n−1)×n × Rn×n as a fixed point sat2n−1 2n−1 isfying Eq. (21). Note that NH1 − L ∈ S+ indicates NH1 ∈ S+ . For any k, we then have by the Law of Cosines and Lemma 3.4: kM(k) − M∗ k2NH1 = kM(k+1) − M∗ k2NH1 + kM(k) − M(k+1) k2NH1

+ 2hM(k+1) − M∗ , M(k) − M(k+1) iNH1 ≥ kM(k+1) − M∗ k2NH1 + kM(k) − M(k+1) k2NH1

−hM(k+1) − M(k) , L(M(k+1) − M(k) )i = kM(k+1) − M∗ k2NH1 + kM(k) − M(k+1) k2NH1 −L from which we can deduce that for any M(0) , the Picard sequence {M(k) } is bounded and lim kM(k) − M(k+1) k2 = 0. These declare that {M(k) } has k→∞

at most one cluster point. That is, {M(k) } is bounded and has exactly one cluster point, therefore converges. References

[1] S. Rao, R. Tron, R. Vidal, and Y. Ma, Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories, IEEE Trans. Pattern Anal. Mach. Intell. 32 (10) (2010) 1832–1845.

35

[2] E. Elhamifar, R. Vidal, Sparse subspace clustering, in: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2790–2797. [3] R. Liu, Z. Lin, Z. Su, J. Cao, Solving Principal Component Pursuit in Linear Time via `1 Filtering, Neurcomputing 142 (2014) 529–541. [4] P.S. Bradley, O.L. Mangasarian, k-plane clustering, J. Global Optim. 16 (1) (2000) 23–32. [5] R. Vidal, Y. Ma, S. Sastry, Generalized principal component analysis, IEEE Trans. Pattern Anal. Mach. Intell. 27 (12) (2005) 1945–1959. [6] J.P. Costeira, T. Kanade, A multibody factorization method for independently moving objects, Int. J. Comput. Vis. 29 (3) (1998) 159–179. [7] Y. Ma, H. Derksen, W. Hong, J. Wright, Segmentation of multivariate mixed data via lossy data coding and compression, IEEE Trans. Pattern Anal. Mach. Intell., 29 (9) (2007) 1546–1562. [8] G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in: Proceedings of the 2010 International Conference on Machine Learning, vol. 3, 2010, pp. 663–670. [9] C. Lu, H. Min, Z. Zhao, L. Zhu, D. Huang, S. Yan, Robust and efficient subspace segmentation via least square regression, in: Proceedings of the European Conference on Computer Vision, 2012, pp. 347–360. [10] R. Vidal, Subspace clustering, IEEE Signal Process. Mag., 28 (2) (2011) 52–68. [11] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Trans. Pattern Anal. Mach. Intell., 34 (1) (2013) 171–184. [12] C. Lu, J. Feng, Z. Lin, S. Yan, Correlation adaptive subspace segmentation by trace lasso, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1345–1352. [13] X. Zhang, C. Xu, X. Sun, G. Baciu, Schatten-q regularizer constrained low rank subspace clustering model, Neurocomputing 182 (19) (2016) 36–47. 36

[14] G. Liu, Z. Zhang, Q. Liu, H. Xiong, Robust subspace clustering with compressed data, IEEE Trans. Image Process. 28 (10) (2019) 5161–5170. [15] J. Shi and J. Malik, Normalized Cuts and Image Segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 888–905. [16] J. Wright, A.Y. Yang, A. Ganesh, S. Sastry, Y. Ma, Robust face recogitnion via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [17] G. Lerman, T. Zhang, Robust recovery of multiple subspaces by geometric `p minimization, Ann. Stat. 39 (5) (2011) 2686-2715. [18] K. Tang, R. Liu, Z. Su, J. Zhang, Structure-constrained low-rank representation, IEEE Trans. Neural Netw. Learn. Syst. 24 (3) (2014) 2167– 2079. [19] W. Jiang, J. Liu, H. Qi, Q. Dai, Robust subspace segmentation via nonconvex low rank representation, Inf. Sci. 340-341 (2016) 144–158. [20] X. Peng, C. Lu, Y. Zhang, H. Tang, Connections between nuclear-norm and Frobenius-norm-based representations, IEEE Trans. Neural Netw. Learn. Syst. 29 (1) (2018) 218–224. [21] X. Peng, J. Feng, S. Xiao, W. Yau, J. Zhou, S. Yang, Structured autoencoders for subspace clustering, IEEE Trans. Image Process. 27 (10) (2018) 5076–5086. [22] X. Peng, Z. Yu, Y. Zhang, H. Tang, Constructing the L2-graph for robust subspace learning and subspace clustering, IEEE Trans. Cybern. 47 (4) (2017) 1053–1066. [23] G. Liu, Q. Liu, P. Li, Blessing of dimensionality: Recovering mixture data via dictionary pursuit, IEEE Trans. Pattern Anal. Mach. Intell. 39 (1) (2017) 47–60. [24] F. Wu, Y. Hu, J. Gao, Y. Sun, B. Yin, Ordered subspace clustering with block-diagonal priors, IEEE Trans. Cybern. 46 (12) (2016) 3209–3219. [25] Y. Guo, J. Gao, F. Li, Spatial subspace clustering for hyperspectral data segmentation, in: Proc. Soc. Digit. Inf. Wireless Commun., Dubai, UAE, 2013, pp. 180–190. 37

[26] S. Tierney, J. Gao, Y. Guo, Subspace clustering for sequential data, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.1019–1026. [27] S. Tierney, J. Gao, Y. Guo, Segmentation of subspaces in sequential data, Pers. Ubiquit. Comput. 12 (12) (2015) 197–221. [28] Z. Lin, R. Liu, Z. Su, Linearized alternating direction method with adaptive penalty for low rank representation, in: Advances in Neural Information Processing Systems, 2011, pp. 612–620. [29] R. Liu, Z. Lin, Z. Su, Linearized alternating direction method with parallel splitting and adaptive penalty for separable convex programs in machine learning, in: Asian Conference on Machine Learning, 2013, pp. 116–132. [30] S. Li, K. Li, Y. Fu, Temporal subspace clustering for human motion segmentation, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4453–4461. [31] F. Nie, H. Wang, H. Huang, C. Ding, Unsupervised and semi-supervised learning via `1 -norm graph, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2268–2273. [32] A. Beck. First-order methods in optimization, volume 25, SIAM, 2017. [33] Q. Li, L. Shen, Y. Xu, N. Zhang, Multi-step fixed-point proximity algorithms for solving a class of optimization problems arising from image processing, Adv. Comput. Math. 41 (2015) 387–422. [34] Q. Li, N. Zhang, Fast proximity-gradient algorithms for structured convex optimization problems, Appl. Comput. Harmon. Anal. 41 (2016) 491–517. [35] X. Li, Q. Lu, Y. Dong, D. Tao, Robust subspace clustering by Cauchy loss function, IEEE Trans. Neural Netw. Learn. Syst. 30 (7) (2019) 2607– 2078. [36] K. Tang, J. Zhang, Z. Su, J. Dong, Bayesian low-rank and sparse nonlinear representation for manifold clustering, Neural Process Lett. 44 (3) (2016): 719–733. 38

[37] K. Tang, Z. Su, W. Jiang, J. Zhang, Superpixels for large dataset subspace clustering, Neural Comput. Applic., 2018. https://doi.org/10.1007/s00521-018-3914-2 [38] C. Tomasi, R. Manduchi, Bilateral filtering for gray and color images, in: Proceedings of the IEEE International Conference on Computer Vision, 1998, pp. 839C-846. [39] X. Hu, Y. Sun, J. Gao, Y. Hu, B. Yin, Locality preserving projection based on F-norm, in: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 1330–1337. [40] H. Bauschke and P. Combettes, Convex analysis and monotone operator theory in hilbert spaces, AMS Books inMathematics, Springer, New York, 2011. [41] Z. Lin, M. Chen and Y. Ma, The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices, Technical report, UILU-ENG-09-2215, 2009. [42] J. Cai, E. Candes and Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (2010) 1956–1982. [43] Q. Yang, K. Tan, N. Ahuja, Real-time O(1) bilateral filtering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 557–564. [44] C. Lu, J. Feng, Z. Lin, T. Mei, S. Yan. Subspace clustering by block diagonal representation, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2) (2019) 487–501. [45] K. Fukuchi, K. Miyazato, A. Kimura, S. Takagi, J. Yamato, Saliencybased video segmentation with graph cuts and sequentially updated priors, in: Proceedings of the IEEE International Conference on Multimedia and Expo, 2009, pp.638–641. [46] L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri. Actions as space-time shapes, IEEE Trans. Pattern Anal. Mach. Intell., 29 (12) (2007) 2247–2253.

39

[47] A. Georghiades, P. Belhumeur, D. Kriegman, From few to many: Illumination cone models for face recognition under variable lighting and pose, IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) (2001) 643–660.

40

Wenyu Hu received his Ph.D. degree in Computational Mathematics from Dalian University of Technology, China, in 2012. He is currently an associate professor with the School of Mathematics and Computer Science of Gannan Normal University, Ganzhou, China. His research interests include machine learning and computer vision.

Shenghao Li received the B.S. degree in Information and Computational Science from Hunan Institute of Engineering, Xiangtan, China, in 2014. He is currently working toward the Mater degree in School of Mathematics and Computer Science, Gannan Normal University, Ganzhou, China. His current research interests include machine learning and computer vision.

Weidong Zheng received the B.S. degree in Information and Computing Science from Gannan Normal University, Ganzhou, China, in 2016. He is currently working toward the Saster degree in School of Mathematics and Computer Science, Gannan Normal University, Ganzhou, China. His research interests include image processing and numerical optimization.

Yao Lu received the B.S. degree in Information and Computing Science from Gannan Normal University, Ganzhou, China, in 2016. He is currently working toward the Saster degree in School of Mathematics and Computer Science, Gannan Normal University, Ganzhou, China. His research interests include image processing and numerical optimization.

41

Gaohang Yu received his Ph.D. degree in Computational Mathematics from Sun Yat-Sen University in 2007. He is currently a professor in the School of Science at Hangzhou Dianzi University. His research interests include optimization theory and methods with applications in machine learning, and imaging sciences.

42

Declaration of interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

43