Auto-weighted multi-view co-clustering via fast matrix factorization

Auto-weighted multi-view co-clustering via fast matrix factorization

Auto-weighted Multi-view Co-clustering via Fast Matrix Factorization Journal Pre-proof Auto-weighted Multi-view Co-clustering via Fast Matrix Factor...

751KB Sizes 0 Downloads 95 Views

Auto-weighted Multi-view Co-clustering via Fast Matrix Factorization

Journal Pre-proof

Auto-weighted Multi-view Co-clustering via Fast Matrix Factorization Feiping Nie, Shaojun Shi, Xuelong Li PII: DOI: Reference:

S0031-3203(20)30013-3 https://doi.org/10.1016/j.patcog.2020.107207 PR 107207

To appear in:

Pattern Recognition

Received date: Revised date: Accepted date:

16 January 2019 28 November 2019 16 January 2020

Please cite this article as: Feiping Nie, Shaojun Shi, Xuelong Li, Auto-weighted Multiview Co-clustering via Fast Matrix Factorization, Pattern Recognition (2020), doi: https://doi.org/10.1016/j.patcog.2020.107207

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Ltd.

Highlights • Distinguishing the existing multi-view clustering methods, the proposed approaches involve constraints of indicator matrix in matrix decomposition. Due to the ex-

istence of constraints, we can directly acquire the clustering results of samples and features. Thus, the proposed methods are highly efficient for the clustering problem of solving multi-view data sets. • According to the importance of each view for the clustering task, the proposed

approaches automatically learn the weight factor in a re-weighted manner. Moreover, the proposed methods are free parameter which make them be more practical.

• Comparing with graph-based multi-view clustering algorithms, the computa-

tional complexity of the proposed methods is same as the traditional K-means algorithm due to the fact that they do not need eigenvalue decomposition in solving process which is heavy computation burden for multi-view data sets.

1

Auto-weighted Multi-view Co-clustering via Fast Matrix Factorization Feiping Niea,∗, Shaojun Shia , Xuelong Lia a School

of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xian 710072, Shaanxi, P. R. China.

Abstract Multi-view clustering is a hot research topic in machine learning and pattern recognition, however, it remains high computational complexity when clustering multi-view data sets. Although a number of approaches have been proposed to accelerate the computational efficiency, most of them do not consider the data duality between features and samples. In this paper, we propose a novel co-clustering approach termed as Fast Multi-view Bilateral K-means (FMVBKM), which can implement clustering task on row and column of the input data matrix, simultaneously. Specifically, FMVBKM applies the relaxed K-means clustering technique to multi-view data clustering. In addition, to decrease information loss in matrix factorization, we further introduce a new co-clustering method named as Fast Multi-view Matrix Tri-Factorization (FMVMTF). Extensive experimental results on six benchmark data sets show that the proposed two approaches not only have comparable clustering performance but also present the high computational efficiency, in comparison with state-of-the-art multi-view clustering methods. Keywords: Co-clustering, multi-view data, matrix factorization, auto-weighted.

1. Introduction Data clustering is a fundamental technique which is widely used in computer vision [1], pattern recognition [2] and data mining communities [3]. The goal of cluster∗ Corresponding

author. Email address: [email protected]. (Feiping Nie)

Preprint submitted to Pattern Recognition

January 21, 2020

ing is to assign data of similar patterns into the same cluster while categorize the data 5

of dissimilar structures into different groups. Traditional clustering algorithms such as K-means and Spectral Clustering, have been researched in the past years. With the rapid development of information technology and the rising of data sharing web sites, the data sets exist heterogeneous features which are collected from different sources. Therefore, clustering these multi-view data becomes a challenging prob-

10

lem. Although spectral clustering and variants have gained good performance, they are not adaptive to large-scale multi-view data since they need to construct an adjacency matrix which has high computational complexity. Recently, K-means clustering (KM) algorithm attracts a lot of attention due to its simplicity, effectiveness [4] and mathematical tractability [5].

15

In many applications, there are many data sets which exist duality. For instance, in document data, the duality represents that clustering documents is based on the relationships of different word clusters, while clustering words is on the basis of the connection with distinct document clusters. Traditional clustering methods cluster samples according to the distribution of features, conversely, clustering features is based on the

20

distribution of samples. Thus, they only consider the one-sided clustering technique and ignore the duality information [6]. In addition, the data sets are often described by multiple views, therefore we focus on multi-view clustering by exploring the heterogeneous features like HOG [7], LBP [8], CENT [9], SIFT [10], GIST [11] and so on. For multi-view data sets, each view has its own specific properties which can result in

25

better clustering performance by effective combination about heterogeneous features. One conventional multi-view clustering algorithm directly concatenates these features from different views into a long feature, and then clusters the long feature by applying clustering approaches of single view. However, the direct accumulation may give rise to over-fitting since the feature dimension highly exceeds the number of samples. Ac-

30

cording to recently proposed multi-view clustering methods [12], they can be roughly divided into three categories: 1) Common feature subspace; 2) Multi-view spectral clustering (MVSC); 3) Multi-view K-means (MVKM) [13]. The first category of methods endeavor to focus on searching for a common lowdimensional feature subspace [14], and then project different views into the subspace 3

35

where we can conduct the subsequent clustering task. The second kind of approaches apply spectral clustering to the multi-view setting. It is termed as multi-view spectral clustering method which exploits the multi-view data geometrical structure by constructing a similarity matrix and clusters the samples by using existing clustering methods. For example, in reference [15], the co-regularized

40

clustering scheme for multi-view spectral clustering makes different graphs agree with each other. The method in [16] is based on co-training that it agrees across the views to search for the clusters for multi-view data. In the work [17], an auto-weighted multiview learning method is proposed which can thoroughly learn the relationship and complex structures of data according to the constructed graph. It can learn the similarity

45

graph and divide the graph into specific clusters, simultaneously. In [18], Nie et al. reformulated the standard spectral clustering algorithm as well as applied it to the multiview clustering and semi-supervised classification task. In the work [19], in order to cluster multi-view data while learn the weight for each view, Nie et al. proposed a selfweighted multi-view clustering method which is based on exploring a Laplacian rank

50

constrained graph. Due to the fact that data exhibits considerable noise, Ren et al. [20] proposed a robust auto-weighted multi-view clustering approach which can make the learned graph approximate the original graph of each view and employ the l2,1 norm to keep robustness. Moreover, considering the nonlinear relationships in real-world data sets, Huang et al. [21] proposed an auto-weighted Multi-view Clustering with Multiple

55

Kernels (MVCMK) algorithm. Although the graph-based multi-view clustering methods have achieved power capability for clustering multi-view data which can explore and fully utilize advantages of the complementary information, there are high computational cost in constructing multiple similarity graphs. Therefore, for large-scale multi-view data sets, multi-view spectral clustering approaches are not adaptive.

60

The third type of algorithms are based on matrix factorization, which are equivalent to the relaxed K-means [22]. The approaches can improve computational efficiency since they do not need build similarity graphs. To keep consistent better clustering performance, Cai et al. [23] proposed a Robust Large-scale Multi-view K-means Clustering (RMVKM) method which can decrease time cost since it can avoid the process

65

of constructing the similarity graphs as well as eigenvalue decomposition in solving 4

the corresponding objective function. To keep robust to noise and outliers, Huang [24] proposed a robust Capped-Norm Multi-view Clustering (CaMVC) approach. In matrix factorization, the low rank approximation is the representative one. Especially, the kernel-based low rank approximation, it is widely used in the different fields such 70

as multi-view clustering, classification and regression and so on. For example, Rakotomamonjy et al. [25] used the low-rank approximation based on multiple kernels to address the problem of large-scale learning. To eliminate users’ vocal rating, Guan et al. [26] proposed a sample compression method via multiple kernel learning approximation. Inspired by the kernel-based multi-view clustering, Xu et al. [27] proposed

75

a Weighted Multi-view Clustering with Feature Selection (WMCFS) algorithm which can simultaneously perform multi-view clustering and feature selection. What is more, in order to take full advantage of the duality in which samples can induce feature clustering while features can also induce sample clustering, some researchers propose co-clustering algorithms [28, 29] which cluster the samples and fea-

80

tures, simultaneously. Co-clustering algorithms are suitable for mutual dependent data sets between row clustering and column clustering as well as sparse data sets. Coclustering which is also called bi-clustering, mainly contains two models, i.e. checkerboard co-cluster structure and diagonal structure. On the one hand, to co-cluster words and documents, the Bipartite Spectral Graph Partition (BSGP) proposed by Dhillon et

85

al. [30] is the most famous one in diagonal co-clustering models. Nevertheless, this kind of method damages clustering performance for multipartitioning problem and is prohibitive for clustering large-scale data due to the existence of singular value decomposition in solving process. Thus, many methods [31] for solving multipartition problem are proposed. On the other hand, Fast Nonnegative Matrix Tri-factorization (FN-

90

MTF) algorithm [32] is the representative approach for the checkerboard co-clustering structure. Due to the fact that the matrix decomposition exists uncertainty, Han et al. [33] proposed a special non-negative matrix decomposition (NMF) algorithm which is termed as Bilateral K-means (BKM) by solving the relaxed minimum normalized cut problem. However, the above mentioned algorithms are designed for single-view data

95

clustering. For multi-view data sets, if we directly extend the single-view clustering algorithm into multi-view data, which is not reasonable since each of multi-view data 5

has different importance for the task of clustering. In this paper, the proposed multi-view clustering approaches are based on the matrix factorization. The highlights and main contributions of this paper are summarized 100

as follows: 1) Distinguishing the existing multi-view clustering methods, the proposed approaches involve constraints of indicator matrix in matrix decomposition. Due to the existence of constraints, we can directly acquire the clustering results of samples and features. Thus, the proposed methods are highly efficient for the clustering problem of

105

solving multi-view data sets. 2) According to the importance of each view for the clustering task, the proposed approaches automatically learn the weight factor in a re-weighted manner. Moreover, the proposed methods are free parameter which make them be more practical. 3) Comparing with graph-based multi-view clustering algorithms, the computa-

110

tional complexity of the proposed methods is same as the traditional K-means algorithm due to the fact that they do not need eigenvalue decomposition in solving process which is heavy computation burden for multi-view data sets. The rest of this paper is organized as follows: In section 2, we will review the classical co-clustering algorithms. In section 3, the proposed FMVBKM method will

115

be given, elaborately. Section 4 presents a tractable and skillful optimization method. The proposed FMVMTF approach and solving process are showed in Section 5. The clustering results as well as analysis on several data sets will be showed in section 6. Finally, the conclusion is presented in section 7. Notations:

120

Before continuing, we will briefly introduce the notations used in this paper. For a data matrix X ∈ Rn×d , where n is the number of samples and d represents the

dimension of features. we denote its (i, j)-th entry by xij . For multi-view data sets, let X v ∈ Rn×mv or Xv ∈ Rn×mv denote the v-th view of data matrix, where mv

is the v-th view dimension of features, the (i, j)-th element can be represented by 125

v Xij , the i-th row and j-th column are denoted as xvi· and xv·j , respectively. Φ denotes

the indicator matrix set which stores the clustering results of samples and features.

Each row of indicator matrices has one and only one element equal to 1 and remaining 6

elements are 0. For the v-th view data matrix Xv , we denote its Frobenius norm by 1

||Xv ||F = (T r(Xv (Xv )T )) 2 , where (Xv )T is the transpose of Xv and T r represents 130

the trace of matrix. In addition, the extracting diagonal elements of matrix Sv can be represented by vector sv . 2. Related work In this section, we first revisit the Bipartite Spectral Graph Partitioning (BSGP) approach which is greatly effective for co-clustering, and then review the Bilateral

135

K-means (BKM) algorithm. Finally, to cluster the multi-view data sets, the bilateral clustering technique is applied to the multi-view clustering case. 2.1. Revisit BSGP Let X ∈ Rn×d be a data matrix. In order to make sample clustering induce feature

clustering while feature clustering induce sample clustering [30], a bipartite graph G = (V, A) is constructed between the samples and features. Moreover, the bipartite graph G can also be seen as an undirected weighted graph, where the V is the node set of size (n + d). The A denotes an adjacency matrix as Eq.(1) 

 X  0

0

A= XT

(1)

The purpose of BSGP is to find the minimal normalized cut on the graph G. It can be defined by solving the following problem: min y

y T Ly y T Dy

(2) (n+d)×1

s.t.y ∈ {−1, 1}

140

.

where L = D − A ∈ R(n+d)×(n+d) is a Laplacian matrix, D ∈ R(n+d)×(n+d) repP resents a diagonal degree matrix with dii = j aij . y = [pT , q T ]T , p ∈ {−1, 1}n×1

denotes the clustering results of samples. q ∈ {−1, 1}d×1 represents the clustering results of features. pi = 1 and qi = 1 denote that the i-th sample and feature belong to the first cluster, otherwise, they belong the second cluster, respectively. 7

2.2. Revisit BKM Suppose the graph G includes c components. We can obtain the following opti145

mization problem by applying the multipartitioning normalized cuts (Ncuts) to BSGP [33]:

min Y

c X ykT Lyk y T Dyk k=1 k

(3)

s.t.Y ∈ Φ(n+d)×c . where yk = [pTk , qkT ]T . When pik = 1, the i-th sample belongs to k-th cluster. When qik = 1, it denotes that the i-th feature belongs to k-th cluster. Due to the fact that ykT Dyk is the (k, k)-th diagonal element in matrix Y T DY . Therefore, the problem(3) can be written as follows: min T r(Y T LY (Y T DY )−1 ) Y

(4)

s.t.Y ∈ Φ(n+d)×c . where the matrix trace can be represented by T r(·) . Suppose I is an identity matrix, L = D − A and Y T = [P T , QT ], where P and Q

are the clustering results of samples and features, respectively. Substituting L into the Eq.(4), it can be rewritten as min T r(Y T LY (Y T DY )−1 ) Y

= min T r(Y T (D − A)Y (Y T DY )−1 ) Y

(5)

= min T r(I − Y T AY (Y T DY )−1 ). Y

According to the definition of A and Y , the above problem can be further rewritten as min T r(−2P T XQ(Y T DY )−1 ) P,Q

(6)

s.t.P ∈ Φn×c , Q ∈ Φd×c .

8

Since the above problem is NP-complete [34], we can relax the above problem into a matrix decomposition problem by adding two terms T r((Y T DY )−1 P T P (Y T DY )−1 QT Q) and T r(X T X). It is min T r(−2P T XQ(Y T DY )−1 ) + T r(X T X)+ P,Q

T r((Y T DY )−1 P T P (Y T DY )−1 QT Q)

(7)

= min ||X − P (Y T DY )−1 QT ||2F P,Q

s.t.P ∈ Φn×c , Q ∈ Φd×c . Note that the matrix (Y T DY )−1 is diagonal, hence, the BKM algorithm can be represented as follows: min ||X − P SQT ||2F

P,Q,S

s.t.P ∈ Φ

n×c

, S ∈ Diag, Q ∈ Φ

(8) m×c

.

where Diag is a diagonal matrix set and matrix S provides degrees of freedom between 150

P and Q. 2.3. Extend to Multi-view Clustering For multi-view data sets, suppose data matrix Xv ∈ Rn×mv denotes the v-th view

features. F ∈ Rn×c is an indicator matrix which stores the clustering results about

samples. Let indicator matrix Gv ∈ Rmv ×c represent the v-th view clustering results about features. To solve the multi-view clustering problem, it can be represented as follows: min

F,Sv ,Gv

M X v=1

||Xv − F Sv GTv ||2F

(9)

s.t.F ∈ Φn×c , Sv ∈ Diag, Gv ∈ Φmv ×c . where Sv is a diagonal matrix. M represents the view number. In next section, we will show that the direct multi-view extension is not optimal. Therefore, we will propose a new more reasonable approach. 9

155

3. The Proposed Fast Multi-view Bilateral K-means Method In this section, we first analyze the deficiencies of applying BKM algorithm to multi-view clustering, directly. This is the motivation of our auto-weighted scheme. Then, we propose a novel adaptively weighted BKM technique for multi-view clustering.

160

3.1. Motivation To the best of our knowledge that one traditional multi-view clustering method concatenates all features together as a single vector in order to use all views of features. However, it is not optimal because the important view of features and less important view of features have equal weight [23]. We can clearly see that Eq.(9) ignores the capacity difference of different views. It is ideal to unify each view of features based on the importance to the clustering task. Therefore, a trivial idea is that we substitute the above problem by using the following problem: M X 1 ||Xv − F Sv GTv ||2F F,Sv ,Gv α v v=1

min

(10)

s.t.F ∈ Φn×c , Sv ∈ Diag, Gv ∈ Φmv ×c , where the constant αv measures the clustering capacity for each view individually. It is αv , min ||Xv − F Sv GTv ||F .

(11)

F,Sv ,Gv

3.2. Adaptively Weighted Bilateral K-means From the above problem, we can see that the αv measures the v-th view clustering capacity and forms a weighted BKM algorithm. Moreover, for each view, the more high clustering capacity, the more large weight, and vice versa. However, it is not reasonable when the v-th view has significant disagreement on Xv comparing with the other views. For example, the v-th view does not desire a large weight

1 αv

when the

αv is small [35]. To solve the problem, we can use a unified evaluation instead of the

10

individually evaluating clustering capacities of views. It can be described as follows: M X 1 ||Xv − F Sv GTv ||2F F,Sv ,Gv α v=1 v

F ∗ , Sv∗ , G∗v =arg min

(12)

s.t.F ∈ Φn×c , Sv ∈ Diag, Gv ∈ Φmv ×c , where αv , ||Xv − F ∗ Sv∗ G∗v T ||F .

(13)

It is clear that the above problem is equivalent to the following problem. It is

min

F,Sv ,Gv

M X v=1

s.t.F ∈ Φ

||Xv − F Sv GTv ||F

n×c

, Sv ∈ Diag, Gv ∈ Φ

(14) mv ×c

.

Which is the proposed multi-view clustering method termed as Fast Multi-view Bilateral K-means (FMVBKM) algorithm. 4. Optimization Algorithm of Solving Problem(14) 165

Before giving the solution of FMVBKM algorithm, we present an important proposition as follows. Definition1. Suppose B ∈ Rk×k , its (i, i)-element is bii , a function f (B) is de-

fined as f (B) = [b11 , b22 , . . . , bkk ]T = b.

Proposition1. If C ∈ Rk×k is a diagonal matrix, for an arbitrary matrix B ∈

170

Rk×k , there is T r(BC) = f (B)T f (C).

Proof: Since C is a diagonal matrix, we can get the following equation. It is Pk T r(BC) = i=1 bii cii = [b11 , b22 , . . . , bkk ][c11 , c22 , . . . , ckk ]T = f (B)T f (C). The difficulty of solving Eq.(14) is from that each row vector of indicator matrices

F and Gv must satisfy 1-of-c coding scheme. In addition, we can see that the Eq.(14) is equal to the following problem(15) which not only learns the weights adaptively, but also takes full advantage of the relationships of multi-view data. It can be represented

11

as follows: min

F,Gv ,Sv

M X v=1

s.t.F ∈ Φ where dv =

1 2||Xv −F Sv GT v ||F

dv ||Xv − F Sv GTv ||2F

n×c

, Sv ∈ Diag, Gv ∈ Φ

(15) mv ×c

.

is the weight for the v-th view.

In order to derive the Eq.(15), we can rewrite it as

min

F,Gv ,Sv ,dv

M X v=1

dv T r{(Xv − F Sv GTv )T (Xv − F Sv GTv )}

(16)

s.t.F ∈ Φn×c , Sv ∈ Diag, Gv ∈ Φmv ×c . Further, we separate the above problem into several subproblems and adopt the alter175

nating iterative strategy to solve them. 1)Step1. Solving Sv , when F , Gv and dv are fixed. Using the properties of matrix trace, the objective function of Eq.(16) can be rewritten as: min

Sv ∈Diag

= min

Sv ∈Diag

M X v=1

M X v=1

dv T r{(Xv − F Sv GTv )T (Xv − F Sv GTv )} {dv {T r(XvT Xv ) − 2T r(GTv XvT F Sv )+

(17)

T r(SvT F T F Sv GTv Gv )}}. Since F and Gv are indicator matrices, F T F and GTv Gv are diagonal matrices. According to Proposition 1, T r(SvT F T F Sv GTv Gv ) =T r(SvT F T F GTv Gv Sv )

(18)

=(f (SvT ))T F T F GTv Gv f (Sv ). Similarly, T r(GTv XvT F Sv ) = (f (GTv XvT F ))T f (Sv ). Let Wv denote F T F GTv Gv .

12

sv represents f (Sv ) and hv denotes f (GTv XvT F ). Therefore, the objective function Eq.(17) can be rewritten as

J = min sv

M X v=1

dv {T r(XvT Xv ) − 2hTv sv + sTv Wv sv }.

(19)

The problem can be transformed to solve sv for solving Sv . Let us take the derivative of Eq.(19) with respect to sv and set the derivative as 0, we have ∂J = 2(Wv sv − hv ) = 0. ∂sv

(20)

The solution of problem (17) can be represented as follows: sv = Wv−1 hv .

(21)

where Wv represents a diagonal matrix, we can easily obtain Wv−1 . 2)Step 2. Solving Gv and F , when Sv and dv are fixed. In order to solve F , we have

min F

M X v=1

= min F

dv ||Xv − F Sv GTv ||2F

n M X X v=1 i=1

= min F

dv ||xvi. − fi. Sv GTv ||22

(22)

n X M X ( dv ||xvi. − fi. Sv GTv ||22 ). i=1 v=1

The above problem can be solved by decoupling the data and assigning the cluster indicator for them one by one independently. That is, we can decompose the optimization problem into n simple subproblems. For each i(1 ≤ i ≤ n), we have min F

M X v=1

dv ||xvi. − fi. Sv GTv ||22 .

(23)

In vector fi. , there is only one element equal to 1 and the remaining elements are zeros.

13

Hence, the solution can be determined by

fij =

  1

j = arg min k

 0

M P

v=1

v 2 dv ||xvi. − rk. | |2

(24)

otherwise.

v where Rv = Sv GTv , rk. is the k-th (k = 1, 2, . . . , c) row of Rv .

Similarly, in order to solve Gv , we have

min Gv

dv ||Xv − F Sv GTv ||2F

v=1

= min Gv

M X

mv M X X v=1 j=1

dv ||xv·j



(25)

v T 2 F Sv (gj· ) ||2 .

The above optimization problem can be decomposed into several simple subproblems. For each j(1 ≤ j ≤ mv ), we have v T 2 ) ||2 . min dv ||xv.j − Pv (gj. Gv

180

(26)

where Pv = F Sv . Therefore, the solution of Eq.(26) can be determined by

v gij =

  1  0

j = arg min dv |xv.j − pv.k |22 k

otherwise.

where pv.k represents the k-th column of Pv . 3)Step 3. Solving dv when F , Sv and Gv are fixed. The procedures of solving Eq.(14) can be summarized in Algorithm 1.

14

(27)

Algorithm 1 Solving FMVBKM via re-weighted method. Input: Data matrix Xv = {X1 , X2 , . . . , XM } ∈ Rn×mv , the maximum number of iterations t = 30. Initialization: Initialize F ∈ Φn×c and Gv ∈ Φmv ×c with arbitrary class indicator 1 for v = 1, 2, . . . , M ; matrices, Set dv = M for iter = 1 : t do − Update Sv by Eq.(21); − Update F by Eq.(24); − Update Gv by Eq.(27); − Update the v-th view weight dv by dv = 2||Xv −F1Sv GT ||F . v − Convergence checking . end for Output: Indicator matrices F for the clustering results of samples and Gv for the v-th view clustering results of features.

5. The Proposed Fast Multi-view Matrix Tri-Factorization (FMVMTF) Approach 185

and the Optimization Algorithm 5.1. The Proposed FMVMTF Approach From the Eq.(14), it can be readily seen that it is a low-rank matrix approximate problem for multi-view data. Moreover, matrix F gives clustering results of samples and matrix Gv gives the clustering results of features for v-th view. Despite Eq.(14) is elegance, the diagonal constraint for Sv which connects the matrix F and Gv is restrictive, it results in information loss and cannot deserve a good low-rank matrix approximation. Therefore, in order to tackle the difficulty, the diagonal constraint is relaxed. In addition, considering that each view exists different importance to clustering results, the ideal view weight can be allocated automatically without explicit weight definition. Furthermore, in order to make the proposed model improve clustering accuracy as well as computational efficiency, we can minimize the following objective function: min

F,Sv ,Gv

M X v=1

s.t.F ∈ Φ

||Xv − F Sv GTv ||F

n×c

, Gv ∈ Φ

15

mv ×c

.

(28)

which is the proposed method termed as Fast Multi-view Matrix Tri-Factorization (FMVMTF) approach. 5.2. Optimization Algorithm of Solving Problem(28) Again, we can use the re-weighted method to solve the above problem. Using the same trick, the Eq.(28) can be transformed into the following problem:

J=

min

F,Gv ,Sv ,dv

M X {dv T r{(Xv − F Sv GTv )T (Xv − F Sv GTv )}} v=1

(29)

s.t.F ∈ Φn×c , Gv ∈ Φmv ×c ,

where dv is the v-th view weight. It can be defined as follows: dv =

1 . 2||Xv − F Sv GTv ||F

(30)

1)Step1. Solving Sv , when F , Gv and dv are fixed. We calculate the partial derivative of Eq.(28) with respect to Sv and set it as 0, we have Sv = (F T F )−1 (F T Xv Gv )(GTv Gv )−1 .

(31)

2)Step2. Fixing Sv and dv , the problem of updating F and Gv is similar to Eq.(24) v and pv.k are the kand Eq.(27), respectively. Let Rv = Sv GTv and Pv = F Sv , rk.

th (k = 1, 2, . . . , c) row of Rv and column of Pv , respectively. Thus, the solution can be obtained by

fij =

  1  0

v gij =

190

  1  0

j = arg min k

M P

v=1

v 2 dv ||xvi. − rk. | |2

(32)

otherwise. j = arg min dv |xv.j − pv.k |22 k

(33)

otherwise.

3)Step3. Fixing F , Sv and Gv , updating dv by Eq.(30). Similarly, the procedures of solving Eq.(28) can be summarized in Algorithm 2. 16

Algorithm 2 Solving FMVMTF via re-weighted method. Input: Data matrix Xv = {X1 , X2 , . . . , XM } ∈ Rn×mv , the maximum number of iterations t = 30. Initialization: Initialize F ∈ Φn×c and Gv ∈ Φmv ×c with arbitrary class indicator 1 , for v = 1, 2, . . . , M ; matrices. Set dv = M for iter = 1 : t do − Update Sv by Eq.(31); − Update F by Eq.(32); − Update Gv by Eq.(33); − Update the v-th view weight dv by Eq.(30). − Convergence checking . end for Output: Indicator matrices F and Gv . 5.3. Computational Complexity Analysis In this section, we will evaluate the time complexity of the proposed two methods. From the Algorithm 1 and Algorithm 2, we can see that the calculation of com195

plexity is equal to time complexity of solving F , Sv , Gv . Note that we only consider operating multiplication due to its high complexity. Solving Sv is respectively O(c) and O(n) time multiplication in Algorithm 1 and Algorithm 2. Computational complexity of solving F is O(M nmct) in the Algorithm 1 and O(M nmc2 t) in the Algorithm 2, where n and m = max{m1 , m2 , . . . , mv } denote separately the num-

200

ber of samples and features for each view, M represents the number of views and t

is the iteration number. O(nmct) and O(nmc2 t) time multiplication in Algorithm 1 and Algorithm 2 in solving Gv , respectively. So the overall time cost is approximately O(c + M nmct + nmct) for FMVBKM and O(n + M nmc2 t + nmc2 t) for FMVMTF. In many applications, it is usually M  n, m  n and c  n, so the 205

complexity of proposed two methods are O(n). 5.4. Convergence Analysis

Since Eq.(14) and Eq.(28) are convex, the problems can converge to global solutions, respectively. Taking the Algorithm 1 for example, the convergence can be proved as follows. To prove the convergence, the Lemma 1 is introduced. Lemma 1 [36] Given two positive numbers a and b, the following inequality holds:

17

a− 210

a2 b2 ≤b− . 2b 2b

(34)

Theorem 1 In Algorithm 1, all updated variables will make the objective value decrease in each iteration until the Algorithm converges. ˜ v respectively represent updated F , Sv and Gv in iteration. Proof: Let F˜ , S˜v and G ˜ v denote the optimal solution, we can get the following inequality Since the F˜ , S˜v and G based on dv : V V X X ˜ Tv ||2 ||Xv − F Sv GTv ||2F ||Xv − F˜ S˜v G F ≤ 2||Xv − F Sv GTv ||F 2||Xv − F Sv GTv ||F v=1 v=1

(35)

In addition, according to Lemma 1, we obtain V X ˜ ˜ ˜T 2 ˜ Tv ||F − ||Xv − F Sv Gv ||F ) ≤ (||Xv − F˜ S˜v G 2||Xv − F Sv GTv ||F v=1

V X ||Xv − F Sv GTv ||2F (||Xv − F Sv GTv ||F − ). 2||Xv − F Sv GTv ||F v=1

(36)

Summing the above two inequations on two sides, we can get V V X X ˜ Tv ||F ) ≤ (||Xv − F˜ S˜v G (||Xv − F Sv GTv ||F ). v=1

(37)

v=1

Considering that the objective function(14) exists a lower bound greater than 0, therefore, the optimization algorithm is convergent. Similarly, the Algorithm 2 can also be proved to be convergent.

215

6. Experiments In this section, we will conduct several experiments to evaluate the performance of the proposed methods on six benchmark data sets which are measured by three standard clustering evaluation metrics. There are Clustering Accuracy (ACC) [37], Normalized Mutual Information (NMI) [37] and Purity [38].

18

220

6.1. Data Set Descriptions The detailed summarization of six data sets that we will use in our experiments are shown in Table 1. Caltech101 This collection consists of 8677 images belonging to 101 categories for object recognition problem. Following previous work [39], we chose the widely used

225

7 classes, i.e.Face, Motorbike, Dolla-Bill, Garfield, Snoopy, Stop-sign and Windsorchair. Totally, we get 1474 images by sampling the data, which is called Caltech1017 (Cal101-7). Moreover, we also chose a data which contains totally 2386 images belonging to 20 classes, i.e.Face, Motorbike, Dolla-Bill, Garfield, Snoopy, Leopards, Binocular, Brain, Camera, Car-Side, Ferry, Hedgehog, Pagoda, Rhino, Stapler, Stop-

230

Sign, Water-Lilly, Windsor-Chair, Wrench and Yin-yang. It is termed as Caltech10120 (Cal101-20). Six visual feature vectors are extracted from each image and contain: Gabor feature with dimension 48, Wavelet moments (WM) with dimension 40, CENTRIST feature with dimension 254, HOG feature with dimension 1984, GIST feature with dimension 512 and LBP feature with dimension 928.

235

Handwritten (HW) This set consists of 2000 patterns(200 data points per class from 0 to 9) and it is from UCI machine learning respository. Six different feature vectors of these digits are used and include 76 Fourier coefficients of the character shapes (FOU), 216 profile correlations (FAC), 64 Karhunen-Love coefficients (KAR), 240 pixel averages in 2 × 3 windows (PIX), 47 Zernike moment (ZER) and 6 morpho-

240

logical (MOR) features.

MNIST The data set of handwritten digits (0∼9) contains 10000 test samples from Yann LeCun’ MNIST page [40]. We use three heterogeneous feature vectors for all images: Isometric projection (ISO) with dimension 30, Linear discriminant analysis (LDA) with dimension 9 and Neighborhood preserving embedding (NPE) with 245

dimension 30. NUS-WIDE The data set totally consists of 81 concepts, 269,648 images about animal concept [41]. We select 12 categories and totally 2400 data samples which compose of cat, cow, dog, elk, hawk, horse, lion, squirrel, tiger, whales, wolf and zebra. Six types of low-level feature vectors are used for each image: 64 color his-

250

togram (CH), 144 color correlogram (CC), 73 edge direction histogram (EDH), 128 19

Table 1: Descriptions of data sets

View 1 2 3 4 5 6 Size Classes

Cal101-7(20) GABOR(48) WM(40) CENT(254) HOG(1984) GIST(512) LBP(928) 1474(2386) 7(20)

HW FOU(76) FAC(216) KAR(64) PIX(240) ZER(47) MOR(6) 2000 10

MNIST ISO(30) LDA(9) NPE(30) 10000 10

NUS CH(64) CC(144) EDH(73) WAV(128) BCM(255) SIFT(500) 2400 12

AWA CQ(2688) LSS(2000) PHOG(252) SIFT(2000) RGSIFT(2000) SURF(2000) 4000 50

wavelet texture (WAV), 255 block-wise color moment (BCM) and 500 bags of words based on SIFT descriptions (SIFT). Animal with attributes (AWA) The animal images contain 50 kinds of animals. For each class, we select the front 80 images and get totally 4000 images. Six different 255

feature vectors are extracted to represent each image: Color Histogram (CH), Local Self-Similarity (LSS), Pyramid HOG (PHOG), SIFT, Color SIFT (RGSIFT) and SURF with dimensions 2688, 2000, 252, 2000, 2000 and 2000, respectively. 6.2. Comparison Scheme Firstly, in order to demonstrate more advantages of the multi-view clustering meth-

260

ods than single-view approaches, we compare the proposed two approaches with BKM algorithm [33] for each view. Secondly, we make comparison with several state-of-theart methods: Co-trained Spectral Clustering (ContrainSC) [16], Co-regularized Spectral Clustering (CoregSC) [15], Multi-view Spectral Clustering (MVSC) [42], Autoweighted Multiple Graph Learning (AMGL) [18], Multi-view Learning with Adaptive

265

Neighbors (MLAN) [17], Self-weighted Multi-view Clustering (SwMC) [19] and Robust Multi-view K-means Clustering (RMVKM) [23]. In co-clustering methods, we set the clustering number of samples as the true number of classes which is equal to the number of feature clustering. What is more, we run the compared approaches based on the optimal parameters setting of the corresponding

270

papers. In order to make the experiments compare fairly, all data sets are normalized in the range of [0, 1] before doing anything. In experimental results, the best results and

20

Table 2: Comparison of the proposed methods with BKM algorithm on Cal101-7 and Cal101-20 data sets

Method BKM(1) BKM(2) BKM(3) BKM(4) BKM(5) BKM(6) FMVBKM FMVMTF

ACC 0.4946 0.4498 0.5312 0.5638 0.5265 0.5746 0.7090 0.7368

Cal101-7 NMI 0.1384 0.1025 0.0147 0.0667 0.2725 0.0969 0.4865 0.6686

Purity 0.6350 0.5753 0.5427 0.5645 0.7273 0.5746 0.8250 0.8426

ACC 0.3466 0.3998 0.3407 0.3433 0.3437 0.3898 0.5754 0.6894

Cal101-20 NMI 0.1165 0.1196 0.0237 0.0209 0.0199 0.0366 0.4535 0.5785

Purity 0.4292 0.4237 0.3407 0.3462 0.3453 0.8980 0.6500 0.6894

Table 3: Comparison of the proposed methods with BKM algorithm on HW and NUS data sets

Method BKM(1) BKM(2) BKM(3) BKM(4) BKM(5) BKM(6) FMVBKM FMVMTF

ACC 0.2433 0.2892 0.4825 0.4780 0.4255 0.3875 0.6038 0.9405

HW NMI 0.0288 0.0744 0.3978 0.4625 0.3448 0.4547 0.6425 0.8754

Purity 0.2458 0.2933 0.5030 0.5065 0.4255 0.3875 0.6165 0.9405

ACC 0.1792 0.1675 0.1400 0.1546 0.1717 0.1283 0.2529 0.2583

NUS NMI 0.0507 0.0521 0.0441 0.0356 0.0519 0.0515 0.1022 0.1296

Purity 0.1858 0.1704 0.1492 0.1583 0.1750 0.1375 0.2529 02729

second results are marked in bold and underline, respectively. 6.3. Experimental Results and Analysis To verify the importance of intercoordination among multiple views, Table 2, Ta275

ble 3 and Table 4 show the clustering performance based on BKM algorithm (for each view) and the proposed two methods on six data sets. From the comparison results, we can see that multi-view clustering has more significant advantages and achieve better clustering results than single view since it considers the relationships between heterogeneous representations. Specifically, for multi-view data sets, each view may include

280

some different information. For Cal101-7, HW, NUS and MNIST data sets, we can observe that the proposed FMVMTF method shows the best clustering results and the proposed FMVBKM approach obtains the sub-optimal clustering results. This is not surprising since FMVMTF 21

Table 4: Comparison of the proposed methods with BKM algorithm on AWA and MNIST data sets

Method BKM(1) BKM(2) BKM(3) BKM(4) BKM(5) BKM(6) FMVBKM FMVMTF

ACC 0.0370 0.0445 0.0205 0.0418 0.0460 0.1985 0.1053 0.1218

AWA NMI 0.0286 0.0324 0.0015 0.0324 0.0305 0.1773 0.1520 0.2015

Purity 0.0375 0.0445 0.0205 0.0418 0.0460 0.1985 0.1075 0.1278

ACC 0.2926 0.4990 0.2978 0.5887 0.8023

MNIST NMI 0.1854 0.3984 0.1745 0.4480 0.6586

Purity 0.2970 0.4990 0.3081 0.5950 0.8023

clustering approach can make low-rank approximation more accuracy by improving 285

degrees of freedom. However, the FMVBKM performance on HW data set has a little higher performance comparing with the single view clustering due to the lower relationships of the row clustering and the column clustering. Moreover, it is worth noting that, although the FMVBKM approach shows the sub-optimal clustering results for MNIST data set, it is approximate performance with the single-view clustering method.

290

Therefore, we can conclude the FMVBKM approach is not suitable for digital data sets because of lack of duality. Table 5, Table 6 and Table 7 show the clustering results about the proposed two approaches and the compared multi-view clustering algorithms. From the summarized clustering results, we can observe that the proposed FMVMTF method almost achieves

295

the better or comparable results than all compared methods. The main reason is that co-clustering technique can make the samples and features cluster for multi-view data sets, simultaneously. Although the proposed FMVBKM algorithm performs the similar clustering results with the compared approaches, it can enhance the efficiency. In addition, the proposed methods can adaptively learn the weight for each view and have no

300

parameter to set. Further, for the AWA data set which contains 50 classes with 80 samples for each class, the performance of proposed methods exceeds the other compared methods. This is not surprising due to the widely accepted hypothesis that clustering of features is helpful to the clustering of samples. As can be seen, the performance of FMVMTF algorithm is always better than FMVBKM algorithm, which demonstrates

22

Table 5: Clustering performance comparison on Cal101-7 and Cal101-20 data sets

Method ContrainSC CoregSC MVSC AMGL MLAN SwMC RMVKM FMVBKM FMVMTF

ACC 0.4410 0.4735 0.5550 0.6615 0.6581 0.6486 0.4593 0.7090 0.7368

Cal101-7 NMI 0.4166 0.3623 0.4304 0.5614 0.4448 0.5597 0.3166 0.4865 0.6686

Purity 0.7901 0.7883 0.8358 0.8426 0.8304 0.8414 0.8141 0.8250 0.8521

ACC 0.3558 0.4057 0.4954 0.5268 0.4489 0.4195 0.3043 0.5754 0.6957

Cal101-20 NMI 0.4762 0.5157 0.5152 0.5180 0.3695 0.3084 0.3520 0.4535 0.5785

Purity 0.5024 0.4994 0.6894 0.6421 0.6169 0.5126 0.6081 0.6500 0.6895

Table 6: Clustering performance comparison on HW and NUS data sets

Method ContrainSC CoregSC MVSC AMGL MLAN SwMC RMVKM FMVBKM FMVMTF

305

ACC 0.5865 0.8420 0.2945 0.8460 0.6830 0.8450 0.3825 0.6038 0.9405

HW NMI 0.5816 0.7769 0.2208 0.8732 0.7255 0.8942 0.4484 0.6425 0.8754

Purity 0.6275 0.8420 0.2985 0.8710 0.7275 0.8820 0.4250 0.6165 0.9405

ACC 0.2696 0.2583 0.2054 0.1988 0.1183 0.1333 0.2183 0.2529 0.2729

NUS NMI 0.1379 0.1286 0.1088 0.0985 0.0395 0.0581 0.1203 0.1022 0.1296

Purity 0.2704 0.2719 0.2246 0.2163 0.1225 0.1437 0.2425 0.2529 0.2808

that low-rank matrix decomposition is more accurate by improving degrees of freedom. Thus, comparing with the state-of-the-art multi-view clustering algorithms, the proposed methods show more strong practicability without needing to preset parameters. 6.4. Studies of view weight

310

Observing the proposed two approaches, the view weights are inversely proportional to the value of corresponding objective function. Therefore, in solving process, the view weight raises as the decrease of corresponding objective function value. To show the view weights are convergent, we give the convergence curve for each view about all data sets. They are shown in Figure 1 and Figure 2, respectively.

23

Table 7: Clustering performance comparison on AWA and MNIST data sets

Method ContrainSC CoregSC MVSC AMGL MLAN SwMC RMVKM FMVBKM FMVMTF

ACC 0.1218 0.1225 0.1133 0.0488 0.0573 0.0283 0.1195 0.1053 0.1258

AWA NMI 0.1970 0.2005 0.1702 0.0535 0.0720 0.0334 0.1852 0.1520 0.2015

Purity 0.1278 0.1335 0.1235 0.0580 0.0648 0.0393 0.1285 0.1075 0.1383

ACC 0.5695 0.5461 0.5555 0.7662 0.5564 0.7581 0.5462 0.5787 0.8023

MNIST NMI 0.5326 0.4868 0.5889 0.6664 0.5978 0.7099 0.4835 0.4480 0.6586

Table 8: The mutual information matrix about Cal101-7 data set

Cal101-7 #v1 #v2 #v3 #v4 #v5 #v6

315

#v1 1 0.6124 0.3090 0.1235 0.2202 0.1470

#v2 0.6124 1 0.3308 0.1273 0.2375 0.1632

#v3 0.3090 0.3308 1 0.2788 0.4814 0.3371

#v4 0.1235 0.1273 0.2788 1 0.4431 0.5452

#v5 0.2202 0.2375 0.4814 0.4431 1 0.5504

Purity 0.6182 0.5601 0.6146 0.7662 0.6089 0.7581 0.5783 0.5950 0.8023

#v6 0.1470 0.1632 0.3371 0.5452 0.5504 1

Figure 1 shows the view weights about FMVBKM algorithm in each iteration. First, for each data set, the initial view weights are given except for NUS data set since the updated weights are far away from the initial weights. For the experiment, suppose the data set contains V views, the initial weights are assigned to 1/V . From the Cal101-7, Cal101-20, AWA and MNIST data sets, we can see that the view weights

320

firstly decrease due to the fact that the first iteration view weights are away from the corresponding initial values. Then, the view weights rise as the relevant objective function value decreases. When the objective value reaches to be minimum, the view weights achieve the optimal solution. Figure 2 presents the view weights about FMVMTF algorithm. The initial view weights are same as the FMVBKM algorithm. Due to the fact

325

that the initial values and updated weight values exist enormous deviation, the weight curves are shown without initial values. Observing the Figure 2, the presented trend is similar to the Figure 1. Here, we do not explain in detail. To show the relationships among views, following the reference [43], the redun-

24

0.18

0.25 0.16

0.16 View 1 View 2 View 3 View 4 View 5 View 6

0.14 0.12 0.1 0.08

0.14

0.2 View 1 View 2 View 3 View 4 View 5 View 6

0.12 0.1

View 1 View 2 View 3 View 4 View 5 View 6

0.15

0.08 0.1

0.06

0.06

0.04

0.05

0.04

0.02 0.02 0

0 5

10

15

20

25

30

5

(a) Cal101 − 7

10

15

20

25

30

5

(b) Cal101 − 20

10

15

20

25

20

25

30

(c) HW

10-3

1.4

0.16 0.3

1.3 0.14 1.2

View 1 View 2 View 3 View 4 View 5

1.1 1

0.25 0.12 0.2

0.1 View 1 View 2 View 3 View 4 View 5 View 6

0.08 0.9 0.06 0.8 0.04

0.15

0.1

0.05

0.7

View 1 View 2 View 3

0.02

0.6 0

5

10

15

20

25

30

(d) N U S

5

10

15

20

25

30

(e) AW A

5

10

15

30

(f) M N IST

Figure 1: Convergence curve of the proposed FMVBKM approach about view weights on six

benchmark data sets.(a)Cal101-7 (b)Cal101-20 (c)HW (d)NUS (e)AWA (f)MNIST. Table 9: The mutual information matrix about Cal101-20 data set

Cal101-20 #v1 #v2 #v3 #v4 #v5 #v6

#v1 1 0.6202 0.3094 0.1234 0.2185 0.1470

#v2 0.6202 1 0.3346 0.1279 0.2372 0.1643

#v3 0.3094 0.3346 1 0.2821 0.4813 0.3394

#v4 0.1234 0.1279 0.2821 1 0.4445 0.5481

#v5 0.2185 0.2372 0.4813 0.4445 1 0.5486

#v6 0.1470 0.1643 0.3394 0.5481 0.5486 1

dancy can be measured by their mutual information. Firstly, each sample is regarded 330

as a variable in every view; Then, the mutual information between arbitrary two views for each sample is computed; Finally, the mean mutual information for all samples is used to measure the relationships between two views. We select four data sets to measure the mutual information among views. They are shown in Table 8-Table 11 about Cal101-7, Cal101-20, HW and MNIST data sets, respectively. Taking the Cal101-7

335

and Cal101-20 data sets for examples, we can see that the mutual information about the first and the second view is larger than any two views. The result shows that the two views exist strong relationships and uniform clustering performance. Similarly, from

25

10-3

10-3

4.5

0.5

3.5

0.45

4 3

0.4

3.5 View 1 View 2 View 3 View 4 View 5 View 6

3 2.5 2

2.5

View 1 View 2 View 3 View 4 View 5 View 6

2

1.5

0.35 0.3 0.25

View 1 View 2 View 3 View 4 View 5 View 6

0.2

1.5

0.15

1 1

0.1 0.5

0.5

0.05

0

0 0

5

10

15

20

25

30

0 0

(a) Cal101 − 7

10

15

20

25

30

0

5

10

(b) Cal101 − 20

10-3

1.5

5

15

20

25

30

(c) HW

0.035

0.24

1.4

0.22 View 1 View 2 View 3 View 4 View 5

1.3 1.2

0.03 0.2 0.025 0.18 View 1 View 2 View 3 View 4 View 5 View 6

1.1 0.02 1 0.015

0.9

0.16 0.14 0.12

0.8

View 1 View 2 View 3

0.01 0.1

0.7 0.6

0.005 0

5

10

15

20

25

30

0.08 0

(d) N U S

5

10

15

20

25

30

0

5

(e) AW A

10

15

20

25

30

(f) M N IST

Figure 2: Convergence curve of the proposed FMVMTF approach about view weights on six

benchmark data sets.(a)Cal101-7 (b)Cal101-20 (c)HW (d)NUS (e)AWA (f)MNIST.

HW #v1 #v2 #v3 #v4 #v5 #v6

Table 10: The mutual information matrix about HW data set

#v1 1 0.4407 0.6176 0.2060 0.5218 0.2518

#v2 0.4407 1 0.4110 0.2356 0.3318 0.1709

#v3 0.6176 0.4110 1 0.1973 0.5535 0.2731

#v4 0.2060 0.2356 0.1973 1 0.1673 0.0716

#v5 0.5218 0.3318 0.5535 0.1673 1 0.2244

#v6 0.2518 0.1709 0.2731 0.0716 0.2244 1

the Figure1(a)-(b) and Figure2(a)-(b), we can also see that the first view and second view can be assigned similar view weights. Moreover, observing the mutual infor340

mation for MNIST data set in Table 11, the first and third view are strong relevant. Similarly, from the Figure 1(f) and Figure 2(f), we can also see that the first view and third view exist similar importance to clustering result. In addition, the second view is assigned to higher view weight comparing with the other two views for the MNIST data set. Further, the experiment about the second single view for proposed method is

345

done, but it is not shown due to the space limitation. Observing the experimental result, the second view plays an important role in clustering. In a whole, the mutual informa-

26

Table 11: The mutual information matrix about MNIST data set

Mnist #1 #2 #3

#1 1 0.5185 0.7282

#2 0.5185 1 0.5144

#3 0.7282 0.5144 1

tion value between two views is larger, they exist stronger relationship. Therefore, the two views exist similar importance for clustering results. 6.5. Selection of the feature clustering number 350

In all co-clustering methods, the clustering numbers about samples and features are generally set to be equal. Since the number of cluster is smaller than the number of samples and features, the proposed two methods can be regarded as low-rank approximation. According to the low-rank theory, the ranks about indicator matrices F and G are equal or less than the number of cluster. Moreover, it is clear that the number

355

of maximum rank depends on the number of feature clusters for indicator matrix G. To show how the rank approximation affects the co-clustering accuracy, we respectively set different numbers of feature clustering. The co-clustering results are shown in Figure 3. Observing the Figure 3, for the HW, NUS, AWA and MNIST data sets, the cluster-

360

ing performance becomes better as the improvement of feature clustering number. For the Cal101-7 and Cal101-20 data sets, although the results appear a little fluctuation, the overall trend gradually raises. Based on these performance, we can conclude that the co-clustering accuracy is influenced by the number of the maximum rank about features. These phenomenon appears due to the fact that as more the feature clustering

365

number, the more useful information. 6.6. Studies of Computational Complexity Table 12 shows the time complexity of all methods on six benchmark data sets. We perform 10 times experiments and record the average running time for all methods on all data sets. It is clear that the proposed FMVBKM algorithm is faster than other

370

compared methods in almost all cases. Especially, for the MNIST data set, due to the

27

90

70

95

80

90 60

70

85 50

50

40

Measure value (%)

80

Measure value (%)

Measure value (%)

60

40

30

30

75

70

65 20

20

60 10

10

55 ACC NMI Purity

0 1

2

3

4

5

6

ACC NMI Purity

0 7

0

1

4

Clustering number about features

(a) Cal101 − 7

10

13

16

1

80

18

70

20

16

60

14

10

12

5

10 ACC NMI Purity

0 4

7

Clustering number about features

(d) N U S

10

12

Measure value (%)

20

15

5

7

10

(c) HW

25

1

3

Clustering number about features

30

0

ACC NMI Purity

50

20

(b) Cal101 − 20

Measure value (%)

Measure value (%)

7

Clustering number about features

50

40

30 ACC NMI Purity

8 01

10

20

30

Clustering number about features

(e) AW A

40

50

ACC NMI Purity

20 1

3

5

7

10

Clustering number about features

(f) M N IST

Figure 3: Clustering performance of the proposed FMVMTF approach about different fea-

ture clustering numbers on six benchmark data sets.(a)Cal101-7 (b)Cal101-20 (c)HW (d)NUS (e)AWA (f)MNIST.

less matrix multiplication for the proposed FMVBKM algorithm in each iteration, the clustering process can be speeded up a lot. However, for AWA data set, running time of the proposed two methods is not fast due to the fact that the time cost is proportional to the number of clustering. Although, the proposed FMVMTF algorithm is a little 375

slower than the FMVBKM algorithm, the clustering results have huge improvement. What is more, Figure 4 shows the convergence curve for the proposed FMVMTF algorithm on all data sets. we can observe that the proposed FMVMTF algorithm can reach convergency with fewer iteration steps [40]. Therefore, the proposed FMVMTF algorithm not only has effective clustering results, but also has low computational com-

380

plexity. The reason is that the computational complexity of matrix factorization is O(n3 ), the proposed FMVMTF algorithm has high computational efficiency own to the introduction of indicator matrices. The effective clustering results are based on two terms. The first is the increased degrees of freedom which can connects the sample indictor matrix with feature indicator matrix; The second is that the row clustering 28

Table 12: Run time (in seconds) comparison of different approaches on six multi-view data sets. (The ” − ” denotes time cost exceeds one hour. )

Method ContrainSC CoregSC MVSC AMGL MLAN SwMC RMVKM FMVMTF FMVBKM

385

Cal101-7 26.37 13.18 17.71 23.14 6.20 79.61 16.08 7.68 5.93

Cal101-20 101.58 42.44 63.92 85.62 27.48 314.28 42.61 29.20 27.20

HW 55.61 24.82 42.81 65.11 7.59 235.74 15.94 5.00 4.95

NUS 88.32 38.11 73.03 170.66 19.19 106.98 23.97 7.89 7.73

AWA 411.11 197.71 317.88 136.31 200.52 240.12 240.09 297.97 283.86

MNIST 1978.24 839.39 29.32 864.09 239.85 119.57 9.29 7.79

induces the column clustering while the column clustering induces row clustering. 7. Conclusion In this work, to simultaneously cluster both the set of samples and the set of features, we firstly propose a Fast Multi-view Bilateral K-means (FMVBKM) method. It adopts main idea of BKM, and adaptively controls the intercoordinations among multi-

390

ple views in a re-weighted manner. It can be readily found that the FMVBKM method is actually a special matrix decomposition problem. It can also be termed as a low-rank matrix approximation problem with the constraints of indicator matrix and diagonal matrix. These constraints make the solution involve less multiplications. Therefore, the proposed method has high computational efficiency. Moreover, the diagonal con-

395

straint in matrix decomposition leads to rather poor low-rank approximation. In order to reduce information loss, we relax FMVBKM model by providing increased degrees of freedom which can make low-rank approximation remain accurate, named as Fast Multi-view Matrix Tri-Factorization (FMVMTF) approach. The proposed methods are conducted in six benchmark data sets and the experimental results show the effec-

400

tiveness of the proposed methods. Further, the experiment of computing complexity on multi-view data sets show that the proposed FMVBKM method has a little higher computational efficiency than the FMVMTF approach. From the experimental results, we can conclude that the FMVMTF method mainly improves the clustering results and the FMVBKM approach primarily enhances the computational efficiency. In the fu29

5800

4500 4400 4300 4200 4100 4000

2500

Objective Function Values

6000

4600

Objective Function Values

Objective Function Values

4700

5600 5400 5200 5000 4800

0

5

10

15

20

25

30

5

10

15

20

25

30

1.46

3400 3350 3300 3250 3200 3150

5

25

30

The Number of Iterations

(d) N U S

10

15

20

25

30

25

30

The Number of Iterations

(c) HW

104

1350

1.44 1.42 1.4 1.38 1.36

20

0

Objective Function Values

3450

Objective Function Values

Objective Function Values

1.48

15

2100

(b) Cal101 − 20

3500

10

2200

The Number of Iterations

(a) Cal101 − 7

5

2300

2000 0

The Number of Iterations

0

2400

1300

1250

1200

1150 0

5

10

15

20

25

30

The Number of Iterations

(e) AW A

0

5

10

15

20

The Number of Iterations

(f) M N IST

Figure 4: Convergence curve of the proposed FMVMTF method on six benchmark data

sets.(a)Cal101-7 (b)Cal101-20 (c)HW (d)NUS (e)AWA (f)MNIST.

405

ture, we will design a multi-view clustering method which is not only effective but also efficient. Acknowledgement This work was supported in part by the National Natural Science Foundation of China grant under number 61772427 and 61751202.

410

References [1] A. Y. Ng, M. I. Jordan, Y. Weiss, On spectral clustering: Analysis and an algorithm, in: Advances in neural information processing systems, 2002, pp. 849– 856. [2] A. K. Jain, Data clustering: 50 years beyond k-means, Pattern recognition letters

415

31 (8) (2010) 651–666.

30

[3] P. Berkhin, A survey of clustering data mining techniques, in: Grouping multidimensional data, Springer, 2006, pp. 25–71. [4] J. Newling, F. Fleuret, Fast k-means with accurate bounds, in: International Conference on Machine Learning, 2016, pp. 936–944. 420

[5] Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, T. Mytkowicz, Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup, in: International Conference on Machine Learning, 2015, pp. 579–587. [6] F. Nie, X. Wang, C. Deng, H. Huang, Learning a structured optimal bipartite graph for co-clustering, in: Advances in Neural Information Processing Systems,

425

2017, pp. 4129–4138. [7] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, IEEE, 2005, pp. 886–893. [8] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation

430

invariant texture classification with local binary patterns, IEEE Transactions on pattern analysis and machine intelligence 24 (7) (2002) 971–987. [9] J. Wu, J. M. Rehg, Where am i: Place instance and category recognition using spatial pact, in: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8.

435

[10] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision 60 (2) (2004) 91–110. [11] A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, International journal of computer vision 42 (3) (2001) 145–175.

440

[12] J. Liu, C. Wang, J. Gao, J. Han, Multi-view clustering via joint nonnegative matrix factorization, in: Proceedings of the 2013 SIAM International Conference on Data Mining, SIAM, 2013, pp. 252–260. 31

[13] Z. Zhang, L. Liu, F. Shen, H. T. Shen, L. Shao, Binary multi-view clustering, IEEE transactions on pattern analysis and machine intelligence 41 (7) (2018) 445

1774–1782. [14] K. Livescu, K. Sridharan, S. Kakade, K. Chauduri, Multi-view clustering via canonical correlation analysis, in: Neural Information Processing Systems Conference, 2008. [15] A. Kumar, P. Rai, H. Daume, Co-regularized multi-view spectral clustering, in:

450

Advances in neural information processing systems, 2011, pp. 1413–1421. [16] A. Kumar, H. Daum´e, A co-training approach for multi-view spectral clustering, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 393–400. [17] F. Nie, G. Cai, J. Li, X. Li, Auto-weighted multi-view learning for image cluster-

455

ing and semi-supervised classification, IEEE Transactions on Image Processing 27 (3) (2018) 1501–1511. [18] F. Nie, J. Li, X. Li, et al., Parameter-free auto-weighted multiple graph learning: A framework for multiview clustering and semi-supervised classification., in: IJCAI, 2016, pp. 1881–1887.

460

[19] F. Nie, J. Li, X. Li, Self-weighted multiview clustering with multiple graphs, in: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, pp. 2564–2570. [20] P. Ren, Y. Xiao, P. Xu, J. Guo, X. Chen, X. Wang, D. Fang, Robust auto-weighted multi-view clustering., in: IJCAI, 2018, pp. 2644–2650.

465

[21] S. Huang, Z. Kang, I. W. Tsang, Z. Xu, Auto-weighted multi-view clustering via kernelized graph learning, Pattern Recognition 88 (2019) 174–184. [22] C. Ding, X. He, H. D. Simon, Nonnegative lagrangian relaxation of k-means and spectral clustering, in: European Conference on Machine Learning, Springer, 2005, pp. 530–538. 32

470

[23] X. Cai, F. Nie, H. Huang, Multi-view k-means clustering on big data., in: IJCAI, 2013, pp. 2598–2604. [24] S. Huang, Y. Ren, Z. Xu, Robust multi-view data clustering with multi-view capped-norm k-means, Neurocomputing 311 (2018) 197–208. [25] A. Rakotomamonjy, S. Chanda, Lp-norm multiple kernel learning with low-rank

475

kernels, Neurocomputing 143 (2014) 68–79. [26] C. Guan, Y. Fu, X. Lu, E. Chen, X. Li, H. Xiong, Efficient karaoke song recommendation via multiple kernel learning approximation, Neurocomputing 254 (2017) 22–32. [27] Y. Xu, C. Wang, J. Lai, Weighted multi-view clustering with feature selection,

480

Pattern Recognition 53 (2016) 25–35. [28] S. Huang, Z. Xu, I. W. Tsang, Z. Kang, Auto-weighted multi-view co-clustering with bipartite graphs, Information Sciences 512 (2020) 18–30. [29] S. Huang, Z. Xu, J. Lv, Adaptive local structure learning for document coclustering, Knowledge-Based Systems 148 (2018) 74–84.

485

[30] I. S. Dhillon, Co-clustering documents and words using bipartite spectral graph partitioning, in: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2001, pp. 269–274. [31] B. Long, X. Wu, Z. M. Zhang, P. S. Yu, Unsupervised learning on k-partite graphs, in: Proceedings of the 12th ACM SIGKDD international conference on

490

Knowledge discovery and data mining, ACM, 2006, pp. 317–326. [32] H. Wang, F. Nie, H. Huang, F. Makedon, Fast nonnegative matrix tri-factorization for large-scale data co-clustering, in: IJCAI Proceedings-International Joint Conference on Artificial Intelligence, Vol. 22, 2011, p. 1553. [33] J. Han, K. Song, F. Nie, X. Li, Bilateral k-means algorithm for fast co-clustering.,

495

in: AAAI, 2017, pp. 1969–1975.

33

[34] X. Y. Stella, J. Shi, Multiclass spectral clustering, in: null, IEEE, 2003, p. 313. [35] F. Nie, L. Tian, X. Li, Multiview clustering via adaptively weighted procrustes, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, 2018, pp. 2022–2030. 500

[36] F. Nie, H. Huang, X. Cai, C. H. Ding, Efficient and robust feature selection via joint 2, 1-norms minimization, in: Advances in neural information processing systems, 2010, pp. 1813–1821. [37] D. Cai, X. He, J. Han, Document clustering using locality preserving indexing, IEEE Transactions on Knowledge and Data Engineering 17 (12) (2005) 1624–

505

1637. [38] R. Varshavsky, M. Linial, D. Horn, Compact: A comparative package for clustering assessment, in: International Symposium on Parallel and Distributed Processing and Applications, Springer, 2005, pp. 159–167. [39] D. Dueck, B. J. Frey, Non-metric affinity propagation for unsupervised image

510

categorization, in: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, IEEE, 2007, pp. 1–8. [40] J. Xu, J. Han, F. Nie, X. Li, Re-weighted discriminatively embedded k-means for multi-view clustering, IEEE Transactions on Image Processing 26 (6) (2017) 3016–3027.

515

[41] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, Nus-wide: a real-world web image database from national university of singapore, in: Proceedings of the ACM international conference on image and video retrieval, ACM, 2009, p. 48. [42] Y. Li, F. Nie, H. Huang, J. Huang, Large-scale multi-view spectral clustering via bipartite graph., in: AAAI, 2015, pp. 2750–2756.

520

[43] D. Niu, J. G. Dy, M. I. Jordan, Multiple non-redundant spectral clustering views, in: Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 831–838.

34

Author Biography Feiping Nie received the Ph.D. degree in Computer Science from Tsinghua University, China in 2009, and currently is full professor in Northwestern Polytechnical University, China. His research interests are machine learning and its applications, such as pattern recognition, data mining, computer vision, image processing and information retrieval. He has published more than 100 papers in the following journals and conferences: TPAMI, IJCV, TIP, TNNLS, TKDE, ICML, NIPS, KDD, IJCAI, AAAI, ICCV, CVPR, ACM MM. His papers have been cited more than 10000 times and the H-index is 57. He is now serving as Associate Editor or PC member for several prestigious journals and conferences in the related fields.

Shaojun Shi is now working toward her PhD degree in the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an, 710072, Shaanxi, China. Her research interests include topics in data mining and machine learning.

Xuelong Li is a full professor with School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, P.R. China. He is a fellow of the IEEE.

35

Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: