Auto-weighted Multi-view Co-clustering via Fast Matrix Factorization
Journal Pre-proof
Auto-weighted Multi-view Co-clustering via Fast Matrix Factorization Feiping Nie, Shaojun Shi, Xuelong Li PII: DOI: Reference:
S0031-3203(20)30013-3 https://doi.org/10.1016/j.patcog.2020.107207 PR 107207
To appear in:
Pattern Recognition
Received date: Revised date: Accepted date:
16 January 2019 28 November 2019 16 January 2020
Please cite this article as: Feiping Nie, Shaojun Shi, Xuelong Li, Auto-weighted Multiview Co-clustering via Fast Matrix Factorization, Pattern Recognition (2020), doi: https://doi.org/10.1016/j.patcog.2020.107207
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Ltd.
Highlights • Distinguishing the existing multi-view clustering methods, the proposed approaches involve constraints of indicator matrix in matrix decomposition. Due to the ex-
istence of constraints, we can directly acquire the clustering results of samples and features. Thus, the proposed methods are highly efficient for the clustering problem of solving multi-view data sets. • According to the importance of each view for the clustering task, the proposed
approaches automatically learn the weight factor in a re-weighted manner. Moreover, the proposed methods are free parameter which make them be more practical.
• Comparing with graph-based multi-view clustering algorithms, the computa-
tional complexity of the proposed methods is same as the traditional K-means algorithm due to the fact that they do not need eigenvalue decomposition in solving process which is heavy computation burden for multi-view data sets.
1
Auto-weighted Multi-view Co-clustering via Fast Matrix Factorization Feiping Niea,∗, Shaojun Shia , Xuelong Lia a School
of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xian 710072, Shaanxi, P. R. China.
Abstract Multi-view clustering is a hot research topic in machine learning and pattern recognition, however, it remains high computational complexity when clustering multi-view data sets. Although a number of approaches have been proposed to accelerate the computational efficiency, most of them do not consider the data duality between features and samples. In this paper, we propose a novel co-clustering approach termed as Fast Multi-view Bilateral K-means (FMVBKM), which can implement clustering task on row and column of the input data matrix, simultaneously. Specifically, FMVBKM applies the relaxed K-means clustering technique to multi-view data clustering. In addition, to decrease information loss in matrix factorization, we further introduce a new co-clustering method named as Fast Multi-view Matrix Tri-Factorization (FMVMTF). Extensive experimental results on six benchmark data sets show that the proposed two approaches not only have comparable clustering performance but also present the high computational efficiency, in comparison with state-of-the-art multi-view clustering methods. Keywords: Co-clustering, multi-view data, matrix factorization, auto-weighted.
1. Introduction Data clustering is a fundamental technique which is widely used in computer vision [1], pattern recognition [2] and data mining communities [3]. The goal of cluster∗ Corresponding
author. Email address:
[email protected]. (Feiping Nie)
Preprint submitted to Pattern Recognition
January 21, 2020
ing is to assign data of similar patterns into the same cluster while categorize the data 5
of dissimilar structures into different groups. Traditional clustering algorithms such as K-means and Spectral Clustering, have been researched in the past years. With the rapid development of information technology and the rising of data sharing web sites, the data sets exist heterogeneous features which are collected from different sources. Therefore, clustering these multi-view data becomes a challenging prob-
10
lem. Although spectral clustering and variants have gained good performance, they are not adaptive to large-scale multi-view data since they need to construct an adjacency matrix which has high computational complexity. Recently, K-means clustering (KM) algorithm attracts a lot of attention due to its simplicity, effectiveness [4] and mathematical tractability [5].
15
In many applications, there are many data sets which exist duality. For instance, in document data, the duality represents that clustering documents is based on the relationships of different word clusters, while clustering words is on the basis of the connection with distinct document clusters. Traditional clustering methods cluster samples according to the distribution of features, conversely, clustering features is based on the
20
distribution of samples. Thus, they only consider the one-sided clustering technique and ignore the duality information [6]. In addition, the data sets are often described by multiple views, therefore we focus on multi-view clustering by exploring the heterogeneous features like HOG [7], LBP [8], CENT [9], SIFT [10], GIST [11] and so on. For multi-view data sets, each view has its own specific properties which can result in
25
better clustering performance by effective combination about heterogeneous features. One conventional multi-view clustering algorithm directly concatenates these features from different views into a long feature, and then clusters the long feature by applying clustering approaches of single view. However, the direct accumulation may give rise to over-fitting since the feature dimension highly exceeds the number of samples. Ac-
30
cording to recently proposed multi-view clustering methods [12], they can be roughly divided into three categories: 1) Common feature subspace; 2) Multi-view spectral clustering (MVSC); 3) Multi-view K-means (MVKM) [13]. The first category of methods endeavor to focus on searching for a common lowdimensional feature subspace [14], and then project different views into the subspace 3
35
where we can conduct the subsequent clustering task. The second kind of approaches apply spectral clustering to the multi-view setting. It is termed as multi-view spectral clustering method which exploits the multi-view data geometrical structure by constructing a similarity matrix and clusters the samples by using existing clustering methods. For example, in reference [15], the co-regularized
40
clustering scheme for multi-view spectral clustering makes different graphs agree with each other. The method in [16] is based on co-training that it agrees across the views to search for the clusters for multi-view data. In the work [17], an auto-weighted multiview learning method is proposed which can thoroughly learn the relationship and complex structures of data according to the constructed graph. It can learn the similarity
45
graph and divide the graph into specific clusters, simultaneously. In [18], Nie et al. reformulated the standard spectral clustering algorithm as well as applied it to the multiview clustering and semi-supervised classification task. In the work [19], in order to cluster multi-view data while learn the weight for each view, Nie et al. proposed a selfweighted multi-view clustering method which is based on exploring a Laplacian rank
50
constrained graph. Due to the fact that data exhibits considerable noise, Ren et al. [20] proposed a robust auto-weighted multi-view clustering approach which can make the learned graph approximate the original graph of each view and employ the l2,1 norm to keep robustness. Moreover, considering the nonlinear relationships in real-world data sets, Huang et al. [21] proposed an auto-weighted Multi-view Clustering with Multiple
55
Kernels (MVCMK) algorithm. Although the graph-based multi-view clustering methods have achieved power capability for clustering multi-view data which can explore and fully utilize advantages of the complementary information, there are high computational cost in constructing multiple similarity graphs. Therefore, for large-scale multi-view data sets, multi-view spectral clustering approaches are not adaptive.
60
The third type of algorithms are based on matrix factorization, which are equivalent to the relaxed K-means [22]. The approaches can improve computational efficiency since they do not need build similarity graphs. To keep consistent better clustering performance, Cai et al. [23] proposed a Robust Large-scale Multi-view K-means Clustering (RMVKM) method which can decrease time cost since it can avoid the process
65
of constructing the similarity graphs as well as eigenvalue decomposition in solving 4
the corresponding objective function. To keep robust to noise and outliers, Huang [24] proposed a robust Capped-Norm Multi-view Clustering (CaMVC) approach. In matrix factorization, the low rank approximation is the representative one. Especially, the kernel-based low rank approximation, it is widely used in the different fields such 70
as multi-view clustering, classification and regression and so on. For example, Rakotomamonjy et al. [25] used the low-rank approximation based on multiple kernels to address the problem of large-scale learning. To eliminate users’ vocal rating, Guan et al. [26] proposed a sample compression method via multiple kernel learning approximation. Inspired by the kernel-based multi-view clustering, Xu et al. [27] proposed
75
a Weighted Multi-view Clustering with Feature Selection (WMCFS) algorithm which can simultaneously perform multi-view clustering and feature selection. What is more, in order to take full advantage of the duality in which samples can induce feature clustering while features can also induce sample clustering, some researchers propose co-clustering algorithms [28, 29] which cluster the samples and fea-
80
tures, simultaneously. Co-clustering algorithms are suitable for mutual dependent data sets between row clustering and column clustering as well as sparse data sets. Coclustering which is also called bi-clustering, mainly contains two models, i.e. checkerboard co-cluster structure and diagonal structure. On the one hand, to co-cluster words and documents, the Bipartite Spectral Graph Partition (BSGP) proposed by Dhillon et
85
al. [30] is the most famous one in diagonal co-clustering models. Nevertheless, this kind of method damages clustering performance for multipartitioning problem and is prohibitive for clustering large-scale data due to the existence of singular value decomposition in solving process. Thus, many methods [31] for solving multipartition problem are proposed. On the other hand, Fast Nonnegative Matrix Tri-factorization (FN-
90
MTF) algorithm [32] is the representative approach for the checkerboard co-clustering structure. Due to the fact that the matrix decomposition exists uncertainty, Han et al. [33] proposed a special non-negative matrix decomposition (NMF) algorithm which is termed as Bilateral K-means (BKM) by solving the relaxed minimum normalized cut problem. However, the above mentioned algorithms are designed for single-view data
95
clustering. For multi-view data sets, if we directly extend the single-view clustering algorithm into multi-view data, which is not reasonable since each of multi-view data 5
has different importance for the task of clustering. In this paper, the proposed multi-view clustering approaches are based on the matrix factorization. The highlights and main contributions of this paper are summarized 100
as follows: 1) Distinguishing the existing multi-view clustering methods, the proposed approaches involve constraints of indicator matrix in matrix decomposition. Due to the existence of constraints, we can directly acquire the clustering results of samples and features. Thus, the proposed methods are highly efficient for the clustering problem of
105
solving multi-view data sets. 2) According to the importance of each view for the clustering task, the proposed approaches automatically learn the weight factor in a re-weighted manner. Moreover, the proposed methods are free parameter which make them be more practical. 3) Comparing with graph-based multi-view clustering algorithms, the computa-
110
tional complexity of the proposed methods is same as the traditional K-means algorithm due to the fact that they do not need eigenvalue decomposition in solving process which is heavy computation burden for multi-view data sets. The rest of this paper is organized as follows: In section 2, we will review the classical co-clustering algorithms. In section 3, the proposed FMVBKM method will
115
be given, elaborately. Section 4 presents a tractable and skillful optimization method. The proposed FMVMTF approach and solving process are showed in Section 5. The clustering results as well as analysis on several data sets will be showed in section 6. Finally, the conclusion is presented in section 7. Notations:
120
Before continuing, we will briefly introduce the notations used in this paper. For a data matrix X ∈ Rn×d , where n is the number of samples and d represents the
dimension of features. we denote its (i, j)-th entry by xij . For multi-view data sets, let X v ∈ Rn×mv or Xv ∈ Rn×mv denote the v-th view of data matrix, where mv
is the v-th view dimension of features, the (i, j)-th element can be represented by 125
v Xij , the i-th row and j-th column are denoted as xvi· and xv·j , respectively. Φ denotes
the indicator matrix set which stores the clustering results of samples and features.
Each row of indicator matrices has one and only one element equal to 1 and remaining 6
elements are 0. For the v-th view data matrix Xv , we denote its Frobenius norm by 1
||Xv ||F = (T r(Xv (Xv )T )) 2 , where (Xv )T is the transpose of Xv and T r represents 130
the trace of matrix. In addition, the extracting diagonal elements of matrix Sv can be represented by vector sv . 2. Related work In this section, we first revisit the Bipartite Spectral Graph Partitioning (BSGP) approach which is greatly effective for co-clustering, and then review the Bilateral
135
K-means (BKM) algorithm. Finally, to cluster the multi-view data sets, the bilateral clustering technique is applied to the multi-view clustering case. 2.1. Revisit BSGP Let X ∈ Rn×d be a data matrix. In order to make sample clustering induce feature
clustering while feature clustering induce sample clustering [30], a bipartite graph G = (V, A) is constructed between the samples and features. Moreover, the bipartite graph G can also be seen as an undirected weighted graph, where the V is the node set of size (n + d). The A denotes an adjacency matrix as Eq.(1)
X 0
0
A= XT
(1)
The purpose of BSGP is to find the minimal normalized cut on the graph G. It can be defined by solving the following problem: min y
y T Ly y T Dy
(2) (n+d)×1
s.t.y ∈ {−1, 1}
140
.
where L = D − A ∈ R(n+d)×(n+d) is a Laplacian matrix, D ∈ R(n+d)×(n+d) repP resents a diagonal degree matrix with dii = j aij . y = [pT , q T ]T , p ∈ {−1, 1}n×1
denotes the clustering results of samples. q ∈ {−1, 1}d×1 represents the clustering results of features. pi = 1 and qi = 1 denote that the i-th sample and feature belong to the first cluster, otherwise, they belong the second cluster, respectively. 7
2.2. Revisit BKM Suppose the graph G includes c components. We can obtain the following opti145
mization problem by applying the multipartitioning normalized cuts (Ncuts) to BSGP [33]:
min Y
c X ykT Lyk y T Dyk k=1 k
(3)
s.t.Y ∈ Φ(n+d)×c . where yk = [pTk , qkT ]T . When pik = 1, the i-th sample belongs to k-th cluster. When qik = 1, it denotes that the i-th feature belongs to k-th cluster. Due to the fact that ykT Dyk is the (k, k)-th diagonal element in matrix Y T DY . Therefore, the problem(3) can be written as follows: min T r(Y T LY (Y T DY )−1 ) Y
(4)
s.t.Y ∈ Φ(n+d)×c . where the matrix trace can be represented by T r(·) . Suppose I is an identity matrix, L = D − A and Y T = [P T , QT ], where P and Q
are the clustering results of samples and features, respectively. Substituting L into the Eq.(4), it can be rewritten as min T r(Y T LY (Y T DY )−1 ) Y
= min T r(Y T (D − A)Y (Y T DY )−1 ) Y
(5)
= min T r(I − Y T AY (Y T DY )−1 ). Y
According to the definition of A and Y , the above problem can be further rewritten as min T r(−2P T XQ(Y T DY )−1 ) P,Q
(6)
s.t.P ∈ Φn×c , Q ∈ Φd×c .
8
Since the above problem is NP-complete [34], we can relax the above problem into a matrix decomposition problem by adding two terms T r((Y T DY )−1 P T P (Y T DY )−1 QT Q) and T r(X T X). It is min T r(−2P T XQ(Y T DY )−1 ) + T r(X T X)+ P,Q
T r((Y T DY )−1 P T P (Y T DY )−1 QT Q)
(7)
= min ||X − P (Y T DY )−1 QT ||2F P,Q
s.t.P ∈ Φn×c , Q ∈ Φd×c . Note that the matrix (Y T DY )−1 is diagonal, hence, the BKM algorithm can be represented as follows: min ||X − P SQT ||2F
P,Q,S
s.t.P ∈ Φ
n×c
, S ∈ Diag, Q ∈ Φ
(8) m×c
.
where Diag is a diagonal matrix set and matrix S provides degrees of freedom between 150
P and Q. 2.3. Extend to Multi-view Clustering For multi-view data sets, suppose data matrix Xv ∈ Rn×mv denotes the v-th view
features. F ∈ Rn×c is an indicator matrix which stores the clustering results about
samples. Let indicator matrix Gv ∈ Rmv ×c represent the v-th view clustering results about features. To solve the multi-view clustering problem, it can be represented as follows: min
F,Sv ,Gv
M X v=1
||Xv − F Sv GTv ||2F
(9)
s.t.F ∈ Φn×c , Sv ∈ Diag, Gv ∈ Φmv ×c . where Sv is a diagonal matrix. M represents the view number. In next section, we will show that the direct multi-view extension is not optimal. Therefore, we will propose a new more reasonable approach. 9
155
3. The Proposed Fast Multi-view Bilateral K-means Method In this section, we first analyze the deficiencies of applying BKM algorithm to multi-view clustering, directly. This is the motivation of our auto-weighted scheme. Then, we propose a novel adaptively weighted BKM technique for multi-view clustering.
160
3.1. Motivation To the best of our knowledge that one traditional multi-view clustering method concatenates all features together as a single vector in order to use all views of features. However, it is not optimal because the important view of features and less important view of features have equal weight [23]. We can clearly see that Eq.(9) ignores the capacity difference of different views. It is ideal to unify each view of features based on the importance to the clustering task. Therefore, a trivial idea is that we substitute the above problem by using the following problem: M X 1 ||Xv − F Sv GTv ||2F F,Sv ,Gv α v v=1
min
(10)
s.t.F ∈ Φn×c , Sv ∈ Diag, Gv ∈ Φmv ×c , where the constant αv measures the clustering capacity for each view individually. It is αv , min ||Xv − F Sv GTv ||F .
(11)
F,Sv ,Gv
3.2. Adaptively Weighted Bilateral K-means From the above problem, we can see that the αv measures the v-th view clustering capacity and forms a weighted BKM algorithm. Moreover, for each view, the more high clustering capacity, the more large weight, and vice versa. However, it is not reasonable when the v-th view has significant disagreement on Xv comparing with the other views. For example, the v-th view does not desire a large weight
1 αv
when the
αv is small [35]. To solve the problem, we can use a unified evaluation instead of the
10
individually evaluating clustering capacities of views. It can be described as follows: M X 1 ||Xv − F Sv GTv ||2F F,Sv ,Gv α v=1 v
F ∗ , Sv∗ , G∗v =arg min
(12)
s.t.F ∈ Φn×c , Sv ∈ Diag, Gv ∈ Φmv ×c , where αv , ||Xv − F ∗ Sv∗ G∗v T ||F .
(13)
It is clear that the above problem is equivalent to the following problem. It is
min
F,Sv ,Gv
M X v=1
s.t.F ∈ Φ
||Xv − F Sv GTv ||F
n×c
, Sv ∈ Diag, Gv ∈ Φ
(14) mv ×c
.
Which is the proposed multi-view clustering method termed as Fast Multi-view Bilateral K-means (FMVBKM) algorithm. 4. Optimization Algorithm of Solving Problem(14) 165
Before giving the solution of FMVBKM algorithm, we present an important proposition as follows. Definition1. Suppose B ∈ Rk×k , its (i, i)-element is bii , a function f (B) is de-
fined as f (B) = [b11 , b22 , . . . , bkk ]T = b.
Proposition1. If C ∈ Rk×k is a diagonal matrix, for an arbitrary matrix B ∈
170
Rk×k , there is T r(BC) = f (B)T f (C).
Proof: Since C is a diagonal matrix, we can get the following equation. It is Pk T r(BC) = i=1 bii cii = [b11 , b22 , . . . , bkk ][c11 , c22 , . . . , ckk ]T = f (B)T f (C). The difficulty of solving Eq.(14) is from that each row vector of indicator matrices
F and Gv must satisfy 1-of-c coding scheme. In addition, we can see that the Eq.(14) is equal to the following problem(15) which not only learns the weights adaptively, but also takes full advantage of the relationships of multi-view data. It can be represented
11
as follows: min
F,Gv ,Sv
M X v=1
s.t.F ∈ Φ where dv =
1 2||Xv −F Sv GT v ||F
dv ||Xv − F Sv GTv ||2F
n×c
, Sv ∈ Diag, Gv ∈ Φ
(15) mv ×c
.
is the weight for the v-th view.
In order to derive the Eq.(15), we can rewrite it as
min
F,Gv ,Sv ,dv
M X v=1
dv T r{(Xv − F Sv GTv )T (Xv − F Sv GTv )}
(16)
s.t.F ∈ Φn×c , Sv ∈ Diag, Gv ∈ Φmv ×c . Further, we separate the above problem into several subproblems and adopt the alter175
nating iterative strategy to solve them. 1)Step1. Solving Sv , when F , Gv and dv are fixed. Using the properties of matrix trace, the objective function of Eq.(16) can be rewritten as: min
Sv ∈Diag
= min
Sv ∈Diag
M X v=1
M X v=1
dv T r{(Xv − F Sv GTv )T (Xv − F Sv GTv )} {dv {T r(XvT Xv ) − 2T r(GTv XvT F Sv )+
(17)
T r(SvT F T F Sv GTv Gv )}}. Since F and Gv are indicator matrices, F T F and GTv Gv are diagonal matrices. According to Proposition 1, T r(SvT F T F Sv GTv Gv ) =T r(SvT F T F GTv Gv Sv )
(18)
=(f (SvT ))T F T F GTv Gv f (Sv ). Similarly, T r(GTv XvT F Sv ) = (f (GTv XvT F ))T f (Sv ). Let Wv denote F T F GTv Gv .
12
sv represents f (Sv ) and hv denotes f (GTv XvT F ). Therefore, the objective function Eq.(17) can be rewritten as
J = min sv
M X v=1
dv {T r(XvT Xv ) − 2hTv sv + sTv Wv sv }.
(19)
The problem can be transformed to solve sv for solving Sv . Let us take the derivative of Eq.(19) with respect to sv and set the derivative as 0, we have ∂J = 2(Wv sv − hv ) = 0. ∂sv
(20)
The solution of problem (17) can be represented as follows: sv = Wv−1 hv .
(21)
where Wv represents a diagonal matrix, we can easily obtain Wv−1 . 2)Step 2. Solving Gv and F , when Sv and dv are fixed. In order to solve F , we have
min F
M X v=1
= min F
dv ||Xv − F Sv GTv ||2F
n M X X v=1 i=1
= min F
dv ||xvi. − fi. Sv GTv ||22
(22)
n X M X ( dv ||xvi. − fi. Sv GTv ||22 ). i=1 v=1
The above problem can be solved by decoupling the data and assigning the cluster indicator for them one by one independently. That is, we can decompose the optimization problem into n simple subproblems. For each i(1 ≤ i ≤ n), we have min F
M X v=1
dv ||xvi. − fi. Sv GTv ||22 .
(23)
In vector fi. , there is only one element equal to 1 and the remaining elements are zeros.
13
Hence, the solution can be determined by
fij =
1
j = arg min k
0
M P
v=1
v 2 dv ||xvi. − rk. | |2
(24)
otherwise.
v where Rv = Sv GTv , rk. is the k-th (k = 1, 2, . . . , c) row of Rv .
Similarly, in order to solve Gv , we have
min Gv
dv ||Xv − F Sv GTv ||2F
v=1
= min Gv
M X
mv M X X v=1 j=1
dv ||xv·j
−
(25)
v T 2 F Sv (gj· ) ||2 .
The above optimization problem can be decomposed into several simple subproblems. For each j(1 ≤ j ≤ mv ), we have v T 2 ) ||2 . min dv ||xv.j − Pv (gj. Gv
180
(26)
where Pv = F Sv . Therefore, the solution of Eq.(26) can be determined by
v gij =
1 0
j = arg min dv |xv.j − pv.k |22 k
otherwise.
where pv.k represents the k-th column of Pv . 3)Step 3. Solving dv when F , Sv and Gv are fixed. The procedures of solving Eq.(14) can be summarized in Algorithm 1.
14
(27)
Algorithm 1 Solving FMVBKM via re-weighted method. Input: Data matrix Xv = {X1 , X2 , . . . , XM } ∈ Rn×mv , the maximum number of iterations t = 30. Initialization: Initialize F ∈ Φn×c and Gv ∈ Φmv ×c with arbitrary class indicator 1 for v = 1, 2, . . . , M ; matrices, Set dv = M for iter = 1 : t do − Update Sv by Eq.(21); − Update F by Eq.(24); − Update Gv by Eq.(27); − Update the v-th view weight dv by dv = 2||Xv −F1Sv GT ||F . v − Convergence checking . end for Output: Indicator matrices F for the clustering results of samples and Gv for the v-th view clustering results of features.
5. The Proposed Fast Multi-view Matrix Tri-Factorization (FMVMTF) Approach 185
and the Optimization Algorithm 5.1. The Proposed FMVMTF Approach From the Eq.(14), it can be readily seen that it is a low-rank matrix approximate problem for multi-view data. Moreover, matrix F gives clustering results of samples and matrix Gv gives the clustering results of features for v-th view. Despite Eq.(14) is elegance, the diagonal constraint for Sv which connects the matrix F and Gv is restrictive, it results in information loss and cannot deserve a good low-rank matrix approximation. Therefore, in order to tackle the difficulty, the diagonal constraint is relaxed. In addition, considering that each view exists different importance to clustering results, the ideal view weight can be allocated automatically without explicit weight definition. Furthermore, in order to make the proposed model improve clustering accuracy as well as computational efficiency, we can minimize the following objective function: min
F,Sv ,Gv
M X v=1
s.t.F ∈ Φ
||Xv − F Sv GTv ||F
n×c
, Gv ∈ Φ
15
mv ×c
.
(28)
which is the proposed method termed as Fast Multi-view Matrix Tri-Factorization (FMVMTF) approach. 5.2. Optimization Algorithm of Solving Problem(28) Again, we can use the re-weighted method to solve the above problem. Using the same trick, the Eq.(28) can be transformed into the following problem:
J=
min
F,Gv ,Sv ,dv
M X {dv T r{(Xv − F Sv GTv )T (Xv − F Sv GTv )}} v=1
(29)
s.t.F ∈ Φn×c , Gv ∈ Φmv ×c ,
where dv is the v-th view weight. It can be defined as follows: dv =
1 . 2||Xv − F Sv GTv ||F
(30)
1)Step1. Solving Sv , when F , Gv and dv are fixed. We calculate the partial derivative of Eq.(28) with respect to Sv and set it as 0, we have Sv = (F T F )−1 (F T Xv Gv )(GTv Gv )−1 .
(31)
2)Step2. Fixing Sv and dv , the problem of updating F and Gv is similar to Eq.(24) v and pv.k are the kand Eq.(27), respectively. Let Rv = Sv GTv and Pv = F Sv , rk.
th (k = 1, 2, . . . , c) row of Rv and column of Pv , respectively. Thus, the solution can be obtained by
fij =
1 0
v gij =
190
1 0
j = arg min k
M P
v=1
v 2 dv ||xvi. − rk. | |2
(32)
otherwise. j = arg min dv |xv.j − pv.k |22 k
(33)
otherwise.
3)Step3. Fixing F , Sv and Gv , updating dv by Eq.(30). Similarly, the procedures of solving Eq.(28) can be summarized in Algorithm 2. 16
Algorithm 2 Solving FMVMTF via re-weighted method. Input: Data matrix Xv = {X1 , X2 , . . . , XM } ∈ Rn×mv , the maximum number of iterations t = 30. Initialization: Initialize F ∈ Φn×c and Gv ∈ Φmv ×c with arbitrary class indicator 1 , for v = 1, 2, . . . , M ; matrices. Set dv = M for iter = 1 : t do − Update Sv by Eq.(31); − Update F by Eq.(32); − Update Gv by Eq.(33); − Update the v-th view weight dv by Eq.(30). − Convergence checking . end for Output: Indicator matrices F and Gv . 5.3. Computational Complexity Analysis In this section, we will evaluate the time complexity of the proposed two methods. From the Algorithm 1 and Algorithm 2, we can see that the calculation of com195
plexity is equal to time complexity of solving F , Sv , Gv . Note that we only consider operating multiplication due to its high complexity. Solving Sv is respectively O(c) and O(n) time multiplication in Algorithm 1 and Algorithm 2. Computational complexity of solving F is O(M nmct) in the Algorithm 1 and O(M nmc2 t) in the Algorithm 2, where n and m = max{m1 , m2 , . . . , mv } denote separately the num-
200
ber of samples and features for each view, M represents the number of views and t
is the iteration number. O(nmct) and O(nmc2 t) time multiplication in Algorithm 1 and Algorithm 2 in solving Gv , respectively. So the overall time cost is approximately O(c + M nmct + nmct) for FMVBKM and O(n + M nmc2 t + nmc2 t) for FMVMTF. In many applications, it is usually M n, m n and c n, so the 205
complexity of proposed two methods are O(n). 5.4. Convergence Analysis
Since Eq.(14) and Eq.(28) are convex, the problems can converge to global solutions, respectively. Taking the Algorithm 1 for example, the convergence can be proved as follows. To prove the convergence, the Lemma 1 is introduced. Lemma 1 [36] Given two positive numbers a and b, the following inequality holds:
17
a− 210
a2 b2 ≤b− . 2b 2b
(34)
Theorem 1 In Algorithm 1, all updated variables will make the objective value decrease in each iteration until the Algorithm converges. ˜ v respectively represent updated F , Sv and Gv in iteration. Proof: Let F˜ , S˜v and G ˜ v denote the optimal solution, we can get the following inequality Since the F˜ , S˜v and G based on dv : V V X X ˜ Tv ||2 ||Xv − F Sv GTv ||2F ||Xv − F˜ S˜v G F ≤ 2||Xv − F Sv GTv ||F 2||Xv − F Sv GTv ||F v=1 v=1
(35)
In addition, according to Lemma 1, we obtain V X ˜ ˜ ˜T 2 ˜ Tv ||F − ||Xv − F Sv Gv ||F ) ≤ (||Xv − F˜ S˜v G 2||Xv − F Sv GTv ||F v=1
V X ||Xv − F Sv GTv ||2F (||Xv − F Sv GTv ||F − ). 2||Xv − F Sv GTv ||F v=1
(36)
Summing the above two inequations on two sides, we can get V V X X ˜ Tv ||F ) ≤ (||Xv − F˜ S˜v G (||Xv − F Sv GTv ||F ). v=1
(37)
v=1
Considering that the objective function(14) exists a lower bound greater than 0, therefore, the optimization algorithm is convergent. Similarly, the Algorithm 2 can also be proved to be convergent.
215
6. Experiments In this section, we will conduct several experiments to evaluate the performance of the proposed methods on six benchmark data sets which are measured by three standard clustering evaluation metrics. There are Clustering Accuracy (ACC) [37], Normalized Mutual Information (NMI) [37] and Purity [38].
18
220
6.1. Data Set Descriptions The detailed summarization of six data sets that we will use in our experiments are shown in Table 1. Caltech101 This collection consists of 8677 images belonging to 101 categories for object recognition problem. Following previous work [39], we chose the widely used
225
7 classes, i.e.Face, Motorbike, Dolla-Bill, Garfield, Snoopy, Stop-sign and Windsorchair. Totally, we get 1474 images by sampling the data, which is called Caltech1017 (Cal101-7). Moreover, we also chose a data which contains totally 2386 images belonging to 20 classes, i.e.Face, Motorbike, Dolla-Bill, Garfield, Snoopy, Leopards, Binocular, Brain, Camera, Car-Side, Ferry, Hedgehog, Pagoda, Rhino, Stapler, Stop-
230
Sign, Water-Lilly, Windsor-Chair, Wrench and Yin-yang. It is termed as Caltech10120 (Cal101-20). Six visual feature vectors are extracted from each image and contain: Gabor feature with dimension 48, Wavelet moments (WM) with dimension 40, CENTRIST feature with dimension 254, HOG feature with dimension 1984, GIST feature with dimension 512 and LBP feature with dimension 928.
235
Handwritten (HW) This set consists of 2000 patterns(200 data points per class from 0 to 9) and it is from UCI machine learning respository. Six different feature vectors of these digits are used and include 76 Fourier coefficients of the character shapes (FOU), 216 profile correlations (FAC), 64 Karhunen-Love coefficients (KAR), 240 pixel averages in 2 × 3 windows (PIX), 47 Zernike moment (ZER) and 6 morpho-
240
logical (MOR) features.
MNIST The data set of handwritten digits (0∼9) contains 10000 test samples from Yann LeCun’ MNIST page [40]. We use three heterogeneous feature vectors for all images: Isometric projection (ISO) with dimension 30, Linear discriminant analysis (LDA) with dimension 9 and Neighborhood preserving embedding (NPE) with 245
dimension 30. NUS-WIDE The data set totally consists of 81 concepts, 269,648 images about animal concept [41]. We select 12 categories and totally 2400 data samples which compose of cat, cow, dog, elk, hawk, horse, lion, squirrel, tiger, whales, wolf and zebra. Six types of low-level feature vectors are used for each image: 64 color his-
250
togram (CH), 144 color correlogram (CC), 73 edge direction histogram (EDH), 128 19
Table 1: Descriptions of data sets
View 1 2 3 4 5 6 Size Classes
Cal101-7(20) GABOR(48) WM(40) CENT(254) HOG(1984) GIST(512) LBP(928) 1474(2386) 7(20)
HW FOU(76) FAC(216) KAR(64) PIX(240) ZER(47) MOR(6) 2000 10
MNIST ISO(30) LDA(9) NPE(30) 10000 10
NUS CH(64) CC(144) EDH(73) WAV(128) BCM(255) SIFT(500) 2400 12
AWA CQ(2688) LSS(2000) PHOG(252) SIFT(2000) RGSIFT(2000) SURF(2000) 4000 50
wavelet texture (WAV), 255 block-wise color moment (BCM) and 500 bags of words based on SIFT descriptions (SIFT). Animal with attributes (AWA) The animal images contain 50 kinds of animals. For each class, we select the front 80 images and get totally 4000 images. Six different 255
feature vectors are extracted to represent each image: Color Histogram (CH), Local Self-Similarity (LSS), Pyramid HOG (PHOG), SIFT, Color SIFT (RGSIFT) and SURF with dimensions 2688, 2000, 252, 2000, 2000 and 2000, respectively. 6.2. Comparison Scheme Firstly, in order to demonstrate more advantages of the multi-view clustering meth-
260
ods than single-view approaches, we compare the proposed two approaches with BKM algorithm [33] for each view. Secondly, we make comparison with several state-of-theart methods: Co-trained Spectral Clustering (ContrainSC) [16], Co-regularized Spectral Clustering (CoregSC) [15], Multi-view Spectral Clustering (MVSC) [42], Autoweighted Multiple Graph Learning (AMGL) [18], Multi-view Learning with Adaptive
265
Neighbors (MLAN) [17], Self-weighted Multi-view Clustering (SwMC) [19] and Robust Multi-view K-means Clustering (RMVKM) [23]. In co-clustering methods, we set the clustering number of samples as the true number of classes which is equal to the number of feature clustering. What is more, we run the compared approaches based on the optimal parameters setting of the corresponding
270
papers. In order to make the experiments compare fairly, all data sets are normalized in the range of [0, 1] before doing anything. In experimental results, the best results and
20
Table 2: Comparison of the proposed methods with BKM algorithm on Cal101-7 and Cal101-20 data sets
Method BKM(1) BKM(2) BKM(3) BKM(4) BKM(5) BKM(6) FMVBKM FMVMTF
ACC 0.4946 0.4498 0.5312 0.5638 0.5265 0.5746 0.7090 0.7368
Cal101-7 NMI 0.1384 0.1025 0.0147 0.0667 0.2725 0.0969 0.4865 0.6686
Purity 0.6350 0.5753 0.5427 0.5645 0.7273 0.5746 0.8250 0.8426
ACC 0.3466 0.3998 0.3407 0.3433 0.3437 0.3898 0.5754 0.6894
Cal101-20 NMI 0.1165 0.1196 0.0237 0.0209 0.0199 0.0366 0.4535 0.5785
Purity 0.4292 0.4237 0.3407 0.3462 0.3453 0.8980 0.6500 0.6894
Table 3: Comparison of the proposed methods with BKM algorithm on HW and NUS data sets
Method BKM(1) BKM(2) BKM(3) BKM(4) BKM(5) BKM(6) FMVBKM FMVMTF
ACC 0.2433 0.2892 0.4825 0.4780 0.4255 0.3875 0.6038 0.9405
HW NMI 0.0288 0.0744 0.3978 0.4625 0.3448 0.4547 0.6425 0.8754
Purity 0.2458 0.2933 0.5030 0.5065 0.4255 0.3875 0.6165 0.9405
ACC 0.1792 0.1675 0.1400 0.1546 0.1717 0.1283 0.2529 0.2583
NUS NMI 0.0507 0.0521 0.0441 0.0356 0.0519 0.0515 0.1022 0.1296
Purity 0.1858 0.1704 0.1492 0.1583 0.1750 0.1375 0.2529 02729
second results are marked in bold and underline, respectively. 6.3. Experimental Results and Analysis To verify the importance of intercoordination among multiple views, Table 2, Ta275
ble 3 and Table 4 show the clustering performance based on BKM algorithm (for each view) and the proposed two methods on six data sets. From the comparison results, we can see that multi-view clustering has more significant advantages and achieve better clustering results than single view since it considers the relationships between heterogeneous representations. Specifically, for multi-view data sets, each view may include
280
some different information. For Cal101-7, HW, NUS and MNIST data sets, we can observe that the proposed FMVMTF method shows the best clustering results and the proposed FMVBKM approach obtains the sub-optimal clustering results. This is not surprising since FMVMTF 21
Table 4: Comparison of the proposed methods with BKM algorithm on AWA and MNIST data sets
Method BKM(1) BKM(2) BKM(3) BKM(4) BKM(5) BKM(6) FMVBKM FMVMTF
ACC 0.0370 0.0445 0.0205 0.0418 0.0460 0.1985 0.1053 0.1218
AWA NMI 0.0286 0.0324 0.0015 0.0324 0.0305 0.1773 0.1520 0.2015
Purity 0.0375 0.0445 0.0205 0.0418 0.0460 0.1985 0.1075 0.1278
ACC 0.2926 0.4990 0.2978 0.5887 0.8023
MNIST NMI 0.1854 0.3984 0.1745 0.4480 0.6586
Purity 0.2970 0.4990 0.3081 0.5950 0.8023
clustering approach can make low-rank approximation more accuracy by improving 285
degrees of freedom. However, the FMVBKM performance on HW data set has a little higher performance comparing with the single view clustering due to the lower relationships of the row clustering and the column clustering. Moreover, it is worth noting that, although the FMVBKM approach shows the sub-optimal clustering results for MNIST data set, it is approximate performance with the single-view clustering method.
290
Therefore, we can conclude the FMVBKM approach is not suitable for digital data sets because of lack of duality. Table 5, Table 6 and Table 7 show the clustering results about the proposed two approaches and the compared multi-view clustering algorithms. From the summarized clustering results, we can observe that the proposed FMVMTF method almost achieves
295
the better or comparable results than all compared methods. The main reason is that co-clustering technique can make the samples and features cluster for multi-view data sets, simultaneously. Although the proposed FMVBKM algorithm performs the similar clustering results with the compared approaches, it can enhance the efficiency. In addition, the proposed methods can adaptively learn the weight for each view and have no
300
parameter to set. Further, for the AWA data set which contains 50 classes with 80 samples for each class, the performance of proposed methods exceeds the other compared methods. This is not surprising due to the widely accepted hypothesis that clustering of features is helpful to the clustering of samples. As can be seen, the performance of FMVMTF algorithm is always better than FMVBKM algorithm, which demonstrates
22
Table 5: Clustering performance comparison on Cal101-7 and Cal101-20 data sets
Method ContrainSC CoregSC MVSC AMGL MLAN SwMC RMVKM FMVBKM FMVMTF
ACC 0.4410 0.4735 0.5550 0.6615 0.6581 0.6486 0.4593 0.7090 0.7368
Cal101-7 NMI 0.4166 0.3623 0.4304 0.5614 0.4448 0.5597 0.3166 0.4865 0.6686
Purity 0.7901 0.7883 0.8358 0.8426 0.8304 0.8414 0.8141 0.8250 0.8521
ACC 0.3558 0.4057 0.4954 0.5268 0.4489 0.4195 0.3043 0.5754 0.6957
Cal101-20 NMI 0.4762 0.5157 0.5152 0.5180 0.3695 0.3084 0.3520 0.4535 0.5785
Purity 0.5024 0.4994 0.6894 0.6421 0.6169 0.5126 0.6081 0.6500 0.6895
Table 6: Clustering performance comparison on HW and NUS data sets
Method ContrainSC CoregSC MVSC AMGL MLAN SwMC RMVKM FMVBKM FMVMTF
305
ACC 0.5865 0.8420 0.2945 0.8460 0.6830 0.8450 0.3825 0.6038 0.9405
HW NMI 0.5816 0.7769 0.2208 0.8732 0.7255 0.8942 0.4484 0.6425 0.8754
Purity 0.6275 0.8420 0.2985 0.8710 0.7275 0.8820 0.4250 0.6165 0.9405
ACC 0.2696 0.2583 0.2054 0.1988 0.1183 0.1333 0.2183 0.2529 0.2729
NUS NMI 0.1379 0.1286 0.1088 0.0985 0.0395 0.0581 0.1203 0.1022 0.1296
Purity 0.2704 0.2719 0.2246 0.2163 0.1225 0.1437 0.2425 0.2529 0.2808
that low-rank matrix decomposition is more accurate by improving degrees of freedom. Thus, comparing with the state-of-the-art multi-view clustering algorithms, the proposed methods show more strong practicability without needing to preset parameters. 6.4. Studies of view weight
310
Observing the proposed two approaches, the view weights are inversely proportional to the value of corresponding objective function. Therefore, in solving process, the view weight raises as the decrease of corresponding objective function value. To show the view weights are convergent, we give the convergence curve for each view about all data sets. They are shown in Figure 1 and Figure 2, respectively.
23
Table 7: Clustering performance comparison on AWA and MNIST data sets
Method ContrainSC CoregSC MVSC AMGL MLAN SwMC RMVKM FMVBKM FMVMTF
ACC 0.1218 0.1225 0.1133 0.0488 0.0573 0.0283 0.1195 0.1053 0.1258
AWA NMI 0.1970 0.2005 0.1702 0.0535 0.0720 0.0334 0.1852 0.1520 0.2015
Purity 0.1278 0.1335 0.1235 0.0580 0.0648 0.0393 0.1285 0.1075 0.1383
ACC 0.5695 0.5461 0.5555 0.7662 0.5564 0.7581 0.5462 0.5787 0.8023
MNIST NMI 0.5326 0.4868 0.5889 0.6664 0.5978 0.7099 0.4835 0.4480 0.6586
Table 8: The mutual information matrix about Cal101-7 data set
Cal101-7 #v1 #v2 #v3 #v4 #v5 #v6
315
#v1 1 0.6124 0.3090 0.1235 0.2202 0.1470
#v2 0.6124 1 0.3308 0.1273 0.2375 0.1632
#v3 0.3090 0.3308 1 0.2788 0.4814 0.3371
#v4 0.1235 0.1273 0.2788 1 0.4431 0.5452
#v5 0.2202 0.2375 0.4814 0.4431 1 0.5504
Purity 0.6182 0.5601 0.6146 0.7662 0.6089 0.7581 0.5783 0.5950 0.8023
#v6 0.1470 0.1632 0.3371 0.5452 0.5504 1
Figure 1 shows the view weights about FMVBKM algorithm in each iteration. First, for each data set, the initial view weights are given except for NUS data set since the updated weights are far away from the initial weights. For the experiment, suppose the data set contains V views, the initial weights are assigned to 1/V . From the Cal101-7, Cal101-20, AWA and MNIST data sets, we can see that the view weights
320
firstly decrease due to the fact that the first iteration view weights are away from the corresponding initial values. Then, the view weights rise as the relevant objective function value decreases. When the objective value reaches to be minimum, the view weights achieve the optimal solution. Figure 2 presents the view weights about FMVMTF algorithm. The initial view weights are same as the FMVBKM algorithm. Due to the fact
325
that the initial values and updated weight values exist enormous deviation, the weight curves are shown without initial values. Observing the Figure 2, the presented trend is similar to the Figure 1. Here, we do not explain in detail. To show the relationships among views, following the reference [43], the redun-
24
0.18
0.25 0.16
0.16 View 1 View 2 View 3 View 4 View 5 View 6
0.14 0.12 0.1 0.08
0.14
0.2 View 1 View 2 View 3 View 4 View 5 View 6
0.12 0.1
View 1 View 2 View 3 View 4 View 5 View 6
0.15
0.08 0.1
0.06
0.06
0.04
0.05
0.04
0.02 0.02 0
0 5
10
15
20
25
30
5
(a) Cal101 − 7
10
15
20
25
30
5
(b) Cal101 − 20
10
15
20
25
20
25
30
(c) HW
10-3
1.4
0.16 0.3
1.3 0.14 1.2
View 1 View 2 View 3 View 4 View 5
1.1 1
0.25 0.12 0.2
0.1 View 1 View 2 View 3 View 4 View 5 View 6
0.08 0.9 0.06 0.8 0.04
0.15
0.1
0.05
0.7
View 1 View 2 View 3
0.02
0.6 0
5
10
15
20
25
30
(d) N U S
5
10
15
20
25
30
(e) AW A
5
10
15
30
(f) M N IST
Figure 1: Convergence curve of the proposed FMVBKM approach about view weights on six
benchmark data sets.(a)Cal101-7 (b)Cal101-20 (c)HW (d)NUS (e)AWA (f)MNIST. Table 9: The mutual information matrix about Cal101-20 data set
Cal101-20 #v1 #v2 #v3 #v4 #v5 #v6
#v1 1 0.6202 0.3094 0.1234 0.2185 0.1470
#v2 0.6202 1 0.3346 0.1279 0.2372 0.1643
#v3 0.3094 0.3346 1 0.2821 0.4813 0.3394
#v4 0.1234 0.1279 0.2821 1 0.4445 0.5481
#v5 0.2185 0.2372 0.4813 0.4445 1 0.5486
#v6 0.1470 0.1643 0.3394 0.5481 0.5486 1
dancy can be measured by their mutual information. Firstly, each sample is regarded 330
as a variable in every view; Then, the mutual information between arbitrary two views for each sample is computed; Finally, the mean mutual information for all samples is used to measure the relationships between two views. We select four data sets to measure the mutual information among views. They are shown in Table 8-Table 11 about Cal101-7, Cal101-20, HW and MNIST data sets, respectively. Taking the Cal101-7
335
and Cal101-20 data sets for examples, we can see that the mutual information about the first and the second view is larger than any two views. The result shows that the two views exist strong relationships and uniform clustering performance. Similarly, from
25
10-3
10-3
4.5
0.5
3.5
0.45
4 3
0.4
3.5 View 1 View 2 View 3 View 4 View 5 View 6
3 2.5 2
2.5
View 1 View 2 View 3 View 4 View 5 View 6
2
1.5
0.35 0.3 0.25
View 1 View 2 View 3 View 4 View 5 View 6
0.2
1.5
0.15
1 1
0.1 0.5
0.5
0.05
0
0 0
5
10
15
20
25
30
0 0
(a) Cal101 − 7
10
15
20
25
30
0
5
10
(b) Cal101 − 20
10-3
1.5
5
15
20
25
30
(c) HW
0.035
0.24
1.4
0.22 View 1 View 2 View 3 View 4 View 5
1.3 1.2
0.03 0.2 0.025 0.18 View 1 View 2 View 3 View 4 View 5 View 6
1.1 0.02 1 0.015
0.9
0.16 0.14 0.12
0.8
View 1 View 2 View 3
0.01 0.1
0.7 0.6
0.005 0
5
10
15
20
25
30
0.08 0
(d) N U S
5
10
15
20
25
30
0
5
(e) AW A
10
15
20
25
30
(f) M N IST
Figure 2: Convergence curve of the proposed FMVMTF approach about view weights on six
benchmark data sets.(a)Cal101-7 (b)Cal101-20 (c)HW (d)NUS (e)AWA (f)MNIST.
HW #v1 #v2 #v3 #v4 #v5 #v6
Table 10: The mutual information matrix about HW data set
#v1 1 0.4407 0.6176 0.2060 0.5218 0.2518
#v2 0.4407 1 0.4110 0.2356 0.3318 0.1709
#v3 0.6176 0.4110 1 0.1973 0.5535 0.2731
#v4 0.2060 0.2356 0.1973 1 0.1673 0.0716
#v5 0.5218 0.3318 0.5535 0.1673 1 0.2244
#v6 0.2518 0.1709 0.2731 0.0716 0.2244 1
the Figure1(a)-(b) and Figure2(a)-(b), we can also see that the first view and second view can be assigned similar view weights. Moreover, observing the mutual infor340
mation for MNIST data set in Table 11, the first and third view are strong relevant. Similarly, from the Figure 1(f) and Figure 2(f), we can also see that the first view and third view exist similar importance to clustering result. In addition, the second view is assigned to higher view weight comparing with the other two views for the MNIST data set. Further, the experiment about the second single view for proposed method is
345
done, but it is not shown due to the space limitation. Observing the experimental result, the second view plays an important role in clustering. In a whole, the mutual informa-
26
Table 11: The mutual information matrix about MNIST data set
Mnist #1 #2 #3
#1 1 0.5185 0.7282
#2 0.5185 1 0.5144
#3 0.7282 0.5144 1
tion value between two views is larger, they exist stronger relationship. Therefore, the two views exist similar importance for clustering results. 6.5. Selection of the feature clustering number 350
In all co-clustering methods, the clustering numbers about samples and features are generally set to be equal. Since the number of cluster is smaller than the number of samples and features, the proposed two methods can be regarded as low-rank approximation. According to the low-rank theory, the ranks about indicator matrices F and G are equal or less than the number of cluster. Moreover, it is clear that the number
355
of maximum rank depends on the number of feature clusters for indicator matrix G. To show how the rank approximation affects the co-clustering accuracy, we respectively set different numbers of feature clustering. The co-clustering results are shown in Figure 3. Observing the Figure 3, for the HW, NUS, AWA and MNIST data sets, the cluster-
360
ing performance becomes better as the improvement of feature clustering number. For the Cal101-7 and Cal101-20 data sets, although the results appear a little fluctuation, the overall trend gradually raises. Based on these performance, we can conclude that the co-clustering accuracy is influenced by the number of the maximum rank about features. These phenomenon appears due to the fact that as more the feature clustering
365
number, the more useful information. 6.6. Studies of Computational Complexity Table 12 shows the time complexity of all methods on six benchmark data sets. We perform 10 times experiments and record the average running time for all methods on all data sets. It is clear that the proposed FMVBKM algorithm is faster than other
370
compared methods in almost all cases. Especially, for the MNIST data set, due to the
27
90
70
95
80
90 60
70
85 50
50
40
Measure value (%)
80
Measure value (%)
Measure value (%)
60
40
30
30
75
70
65 20
20
60 10
10
55 ACC NMI Purity
0 1
2
3
4
5
6
ACC NMI Purity
0 7
0
1
4
Clustering number about features
(a) Cal101 − 7
10
13
16
1
80
18
70
20
16
60
14
10
12
5
10 ACC NMI Purity
0 4
7
Clustering number about features
(d) N U S
10
12
Measure value (%)
20
15
5
7
10
(c) HW
25
1
3
Clustering number about features
30
0
ACC NMI Purity
50
20
(b) Cal101 − 20
Measure value (%)
Measure value (%)
7
Clustering number about features
50
40
30 ACC NMI Purity
8 01
10
20
30
Clustering number about features
(e) AW A
40
50
ACC NMI Purity
20 1
3
5
7
10
Clustering number about features
(f) M N IST
Figure 3: Clustering performance of the proposed FMVMTF approach about different fea-
ture clustering numbers on six benchmark data sets.(a)Cal101-7 (b)Cal101-20 (c)HW (d)NUS (e)AWA (f)MNIST.
less matrix multiplication for the proposed FMVBKM algorithm in each iteration, the clustering process can be speeded up a lot. However, for AWA data set, running time of the proposed two methods is not fast due to the fact that the time cost is proportional to the number of clustering. Although, the proposed FMVMTF algorithm is a little 375
slower than the FMVBKM algorithm, the clustering results have huge improvement. What is more, Figure 4 shows the convergence curve for the proposed FMVMTF algorithm on all data sets. we can observe that the proposed FMVMTF algorithm can reach convergency with fewer iteration steps [40]. Therefore, the proposed FMVMTF algorithm not only has effective clustering results, but also has low computational com-
380
plexity. The reason is that the computational complexity of matrix factorization is O(n3 ), the proposed FMVMTF algorithm has high computational efficiency own to the introduction of indicator matrices. The effective clustering results are based on two terms. The first is the increased degrees of freedom which can connects the sample indictor matrix with feature indicator matrix; The second is that the row clustering 28
Table 12: Run time (in seconds) comparison of different approaches on six multi-view data sets. (The ” − ” denotes time cost exceeds one hour. )
Method ContrainSC CoregSC MVSC AMGL MLAN SwMC RMVKM FMVMTF FMVBKM
385
Cal101-7 26.37 13.18 17.71 23.14 6.20 79.61 16.08 7.68 5.93
Cal101-20 101.58 42.44 63.92 85.62 27.48 314.28 42.61 29.20 27.20
HW 55.61 24.82 42.81 65.11 7.59 235.74 15.94 5.00 4.95
NUS 88.32 38.11 73.03 170.66 19.19 106.98 23.97 7.89 7.73
AWA 411.11 197.71 317.88 136.31 200.52 240.12 240.09 297.97 283.86
MNIST 1978.24 839.39 29.32 864.09 239.85 119.57 9.29 7.79
induces the column clustering while the column clustering induces row clustering. 7. Conclusion In this work, to simultaneously cluster both the set of samples and the set of features, we firstly propose a Fast Multi-view Bilateral K-means (FMVBKM) method. It adopts main idea of BKM, and adaptively controls the intercoordinations among multi-
390
ple views in a re-weighted manner. It can be readily found that the FMVBKM method is actually a special matrix decomposition problem. It can also be termed as a low-rank matrix approximation problem with the constraints of indicator matrix and diagonal matrix. These constraints make the solution involve less multiplications. Therefore, the proposed method has high computational efficiency. Moreover, the diagonal con-
395
straint in matrix decomposition leads to rather poor low-rank approximation. In order to reduce information loss, we relax FMVBKM model by providing increased degrees of freedom which can make low-rank approximation remain accurate, named as Fast Multi-view Matrix Tri-Factorization (FMVMTF) approach. The proposed methods are conducted in six benchmark data sets and the experimental results show the effec-
400
tiveness of the proposed methods. Further, the experiment of computing complexity on multi-view data sets show that the proposed FMVBKM method has a little higher computational efficiency than the FMVMTF approach. From the experimental results, we can conclude that the FMVMTF method mainly improves the clustering results and the FMVBKM approach primarily enhances the computational efficiency. In the fu29
5800
4500 4400 4300 4200 4100 4000
2500
Objective Function Values
6000
4600
Objective Function Values
Objective Function Values
4700
5600 5400 5200 5000 4800
0
5
10
15
20
25
30
5
10
15
20
25
30
1.46
3400 3350 3300 3250 3200 3150
5
25
30
The Number of Iterations
(d) N U S
10
15
20
25
30
25
30
The Number of Iterations
(c) HW
104
1350
1.44 1.42 1.4 1.38 1.36
20
0
Objective Function Values
3450
Objective Function Values
Objective Function Values
1.48
15
2100
(b) Cal101 − 20
3500
10
2200
The Number of Iterations
(a) Cal101 − 7
5
2300
2000 0
The Number of Iterations
0
2400
1300
1250
1200
1150 0
5
10
15
20
25
30
The Number of Iterations
(e) AW A
0
5
10
15
20
The Number of Iterations
(f) M N IST
Figure 4: Convergence curve of the proposed FMVMTF method on six benchmark data
sets.(a)Cal101-7 (b)Cal101-20 (c)HW (d)NUS (e)AWA (f)MNIST.
405
ture, we will design a multi-view clustering method which is not only effective but also efficient. Acknowledgement This work was supported in part by the National Natural Science Foundation of China grant under number 61772427 and 61751202.
410
References [1] A. Y. Ng, M. I. Jordan, Y. Weiss, On spectral clustering: Analysis and an algorithm, in: Advances in neural information processing systems, 2002, pp. 849– 856. [2] A. K. Jain, Data clustering: 50 years beyond k-means, Pattern recognition letters
415
31 (8) (2010) 651–666.
30
[3] P. Berkhin, A survey of clustering data mining techniques, in: Grouping multidimensional data, Springer, 2006, pp. 25–71. [4] J. Newling, F. Fleuret, Fast k-means with accurate bounds, in: International Conference on Machine Learning, 2016, pp. 936–944. 420
[5] Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, T. Mytkowicz, Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup, in: International Conference on Machine Learning, 2015, pp. 579–587. [6] F. Nie, X. Wang, C. Deng, H. Huang, Learning a structured optimal bipartite graph for co-clustering, in: Advances in Neural Information Processing Systems,
425
2017, pp. 4129–4138. [7] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, IEEE, 2005, pp. 886–893. [8] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation
430
invariant texture classification with local binary patterns, IEEE Transactions on pattern analysis and machine intelligence 24 (7) (2002) 971–987. [9] J. Wu, J. M. Rehg, Where am i: Place instance and category recognition using spatial pact, in: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8.
435
[10] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision 60 (2) (2004) 91–110. [11] A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, International journal of computer vision 42 (3) (2001) 145–175.
440
[12] J. Liu, C. Wang, J. Gao, J. Han, Multi-view clustering via joint nonnegative matrix factorization, in: Proceedings of the 2013 SIAM International Conference on Data Mining, SIAM, 2013, pp. 252–260. 31
[13] Z. Zhang, L. Liu, F. Shen, H. T. Shen, L. Shao, Binary multi-view clustering, IEEE transactions on pattern analysis and machine intelligence 41 (7) (2018) 445
1774–1782. [14] K. Livescu, K. Sridharan, S. Kakade, K. Chauduri, Multi-view clustering via canonical correlation analysis, in: Neural Information Processing Systems Conference, 2008. [15] A. Kumar, P. Rai, H. Daume, Co-regularized multi-view spectral clustering, in:
450
Advances in neural information processing systems, 2011, pp. 1413–1421. [16] A. Kumar, H. Daum´e, A co-training approach for multi-view spectral clustering, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 393–400. [17] F. Nie, G. Cai, J. Li, X. Li, Auto-weighted multi-view learning for image cluster-
455
ing and semi-supervised classification, IEEE Transactions on Image Processing 27 (3) (2018) 1501–1511. [18] F. Nie, J. Li, X. Li, et al., Parameter-free auto-weighted multiple graph learning: A framework for multiview clustering and semi-supervised classification., in: IJCAI, 2016, pp. 1881–1887.
460
[19] F. Nie, J. Li, X. Li, Self-weighted multiview clustering with multiple graphs, in: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, pp. 2564–2570. [20] P. Ren, Y. Xiao, P. Xu, J. Guo, X. Chen, X. Wang, D. Fang, Robust auto-weighted multi-view clustering., in: IJCAI, 2018, pp. 2644–2650.
465
[21] S. Huang, Z. Kang, I. W. Tsang, Z. Xu, Auto-weighted multi-view clustering via kernelized graph learning, Pattern Recognition 88 (2019) 174–184. [22] C. Ding, X. He, H. D. Simon, Nonnegative lagrangian relaxation of k-means and spectral clustering, in: European Conference on Machine Learning, Springer, 2005, pp. 530–538. 32
470
[23] X. Cai, F. Nie, H. Huang, Multi-view k-means clustering on big data., in: IJCAI, 2013, pp. 2598–2604. [24] S. Huang, Y. Ren, Z. Xu, Robust multi-view data clustering with multi-view capped-norm k-means, Neurocomputing 311 (2018) 197–208. [25] A. Rakotomamonjy, S. Chanda, Lp-norm multiple kernel learning with low-rank
475
kernels, Neurocomputing 143 (2014) 68–79. [26] C. Guan, Y. Fu, X. Lu, E. Chen, X. Li, H. Xiong, Efficient karaoke song recommendation via multiple kernel learning approximation, Neurocomputing 254 (2017) 22–32. [27] Y. Xu, C. Wang, J. Lai, Weighted multi-view clustering with feature selection,
480
Pattern Recognition 53 (2016) 25–35. [28] S. Huang, Z. Xu, I. W. Tsang, Z. Kang, Auto-weighted multi-view co-clustering with bipartite graphs, Information Sciences 512 (2020) 18–30. [29] S. Huang, Z. Xu, J. Lv, Adaptive local structure learning for document coclustering, Knowledge-Based Systems 148 (2018) 74–84.
485
[30] I. S. Dhillon, Co-clustering documents and words using bipartite spectral graph partitioning, in: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2001, pp. 269–274. [31] B. Long, X. Wu, Z. M. Zhang, P. S. Yu, Unsupervised learning on k-partite graphs, in: Proceedings of the 12th ACM SIGKDD international conference on
490
Knowledge discovery and data mining, ACM, 2006, pp. 317–326. [32] H. Wang, F. Nie, H. Huang, F. Makedon, Fast nonnegative matrix tri-factorization for large-scale data co-clustering, in: IJCAI Proceedings-International Joint Conference on Artificial Intelligence, Vol. 22, 2011, p. 1553. [33] J. Han, K. Song, F. Nie, X. Li, Bilateral k-means algorithm for fast co-clustering.,
495
in: AAAI, 2017, pp. 1969–1975.
33
[34] X. Y. Stella, J. Shi, Multiclass spectral clustering, in: null, IEEE, 2003, p. 313. [35] F. Nie, L. Tian, X. Li, Multiview clustering via adaptively weighted procrustes, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, 2018, pp. 2022–2030. 500
[36] F. Nie, H. Huang, X. Cai, C. H. Ding, Efficient and robust feature selection via joint 2, 1-norms minimization, in: Advances in neural information processing systems, 2010, pp. 1813–1821. [37] D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Transactions on Knowledge and Data Engineering 17 (12) (2005) 1624–
505
1637. [38] R. Varshavsky, M. Linial, D. Horn, Compact: A comparative package for clustering assessment, in: International Symposium on Parallel and Distributed Processing and Applications, Springer, 2005, pp. 159–167. [39] D. Dueck, B. J. Frey, Non-metric affinity propagation for unsupervised image
510
categorization, in: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, IEEE, 2007, pp. 1–8. [40] J. Xu, J. Han, F. Nie, X. Li, Re-weighted discriminatively embedded k-means for multi-view clustering, IEEE Transactions on Image Processing 26 (6) (2017) 3016–3027.
515
[41] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, Nus-wide: a real-world web image database from national university of singapore, in: Proceedings of the ACM international conference on image and video retrieval, ACM, 2009, p. 48. [42] Y. Li, F. Nie, H. Huang, J. Huang, Large-scale multi-view spectral clustering via bipartite graph., in: AAAI, 2015, pp. 2750–2756.
520
[43] D. Niu, J. G. Dy, M. I. Jordan, Multiple non-redundant spectral clustering views, in: Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 831–838.
34
Author Biography Feiping Nie received the Ph.D. degree in Computer Science from Tsinghua University, China in 2009, and currently is full professor in Northwestern Polytechnical University, China. His research interests are machine learning and its applications, such as pattern recognition, data mining, computer vision, image processing and information retrieval. He has published more than 100 papers in the following journals and conferences: TPAMI, IJCV, TIP, TNNLS, TKDE, ICML, NIPS, KDD, IJCAI, AAAI, ICCV, CVPR, ACM MM. His papers have been cited more than 10000 times and the H-index is 57. He is now serving as Associate Editor or PC member for several prestigious journals and conferences in the related fields.
Shaojun Shi is now working toward her PhD degree in the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an, 710072, Shaanxi, China. Her research interests include topics in data mining and machine learning.
Xuelong Li is a full professor with School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, P.R. China. He is a fellow of the IEEE.
35
Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: