Multi-view clustering via clusterwise weights learning

Multi-view clustering via clusterwise weights learning

Journal Pre-proof Multi-view clustering via clusterwise weights learning Qianli Zhao, Linlin Zong, Xianchao Zhang, Xinyue Liu, Hong Yu PII: DOI: Refe...

709KB Sizes 0 Downloads 62 Views

Journal Pre-proof Multi-view clustering via clusterwise weights learning Qianli Zhao, Linlin Zong, Xianchao Zhang, Xinyue Liu, Hong Yu

PII: DOI: Reference:

S0950-7051(19)30672-0 https://doi.org/10.1016/j.knosys.2019.105459 KNOSYS 105459

To appear in:

Knowledge-Based Systems

Received date : 28 May 2019 Revised date : 23 December 2019 Accepted date : 28 December 2019 Please cite this article as: Q. Zhao, L. Zong, X. Zhang et al., Multi-view clustering via clusterwise weights learning, Knowledge-Based Systems (2020), doi: https://doi.org/10.1016/j.knosys.2019.105459. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

Journal Pre-proof

pro of

Multi-View Clustering via Clusterwise Weights Learning Qianli Zhao, Linlin Zong∗, Xianchao Zhang, Xinyue Liu, Hong Yu

Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian School of Software, Dalian University of Technology, Dalian, China, 116620

Abstract

It is known that the performance of multi-view clustering could be improved by

re-

assigning weights to the views, since different views play different roles in the final clustering results. Nevertheless, we observe that weights could be further refined, since in reality different clusters also have different impacts on finding the correct results. We propose a multi-view clustering algorithm with clusterwise weights (MCW), which assigns a weight on each cluster within each

lP

view. The objective function of MCW consists of three parts: (1) intra-view clustering: clustering each view by using non-negative matrix factorization; (2) inter-view relationship learning: learning the consensus clustering results by a weighted combination of each view; (3) clusterwise weight learning: learning

urn a

the weight of a cluster by making the weight be proportional to the average distance between the cluster and other clusters. We present an effective alternating algorithm to solve the non-convex optimization problem. Experimental results on several benchmark datasets demonstrate the superiority of the proposed algorithm over existing multi-view clustering methods. Keywords: Multi-view, Non-negative matrix factorization, Weight.

∗ Corresponding

Jo

author Email address: [email protected] (Linlin Zong)

Preprint submitted to Journal of LATEX Templates

December 30, 2019

Journal Pre-proof

view 2

pro of

view 1

Figure 1: Two-view data with three clusters. The instances in view 1 and view 2 with the same color belong to the same latent cluster, the circle denotes the region where most of instances within each cluster are located.

re-

1. Introduction

Traditional clustering methods focus on datasets with a single view. However, one instance can be observed from different views. For example, A web page can be represented with the full text in the page. It can also be represented with the anchor text attached to hyperlink pointing to itself. By precisely

lP

5

describing the dataset, multiple views provide complementary information to facilitate learning tasks. Multi-view clustering, which classifies the similar samples into the same group and dissimilar samples into different groups by mining the multi-view features simultaneously, has attracted increasing attention in the past decade.

urn a

10

The early multi-view clustering algorithms treat all the views equally. However, it is more rational to assign the views with different weights [1]. Recently, several clustering algorithms have been proposed to discriminate different views. A series of algorithms[1] [2] [3] [4] use an algebraic method to minimize the ob15

jective function in all the views. There are also works[5] [6] that allocate weights for each view automatically without additional weights and penalty parameters.

Jo

Zong[7] proposed a weighted multi-view spectral clustering algorithm based on the spectral perturbation theory. Serveral algorithms[8] [9] [10] [11] aggregated multiple affinity matrices into a shared affinity matrix.

20

Note that previous studies on weighted multi-view clustering only assign one

2

Journal Pre-proof

weight to each view. However, in some applications, it is often the case that the distributions of the clusters vary from view to view. In other words, a cluster

pro of

may be easily detected in one view, but hard to be found in another view. For example, Figure 1 shows a typical two-view data with three clusters. In view 1, 25

it is easy to find the blue cluster, but it is difficult to separate the red one from the green cluster. Whereas, in view 2, the green cluster is easily separable, but the red and the blue ones are blended. In this case, if we only allocate one weight to each view while ignore the different impacts of different clusters, the negative information that ”the red and green points may be merged to the same cluster”

30

in view 1 may be transferred to view 2. The similar negative information also

re-

occurs in view 2 with the red and blue points. After the negative information interacts between the two views, all three clusters are hard to distinguish from each other consequently.

In this paper, we propose a multi-view clustering algorithm via clusterwise 35

weights learning (MCW), which assigns a weight to each cluster of each view.

lP

The overall objective function of MCW consists of three parts. (1) Intra-view clustering. In this part, we find the intra-view clusters and update the clusters by the knowledge transferred from other views. We adopt non-negative matrix factorization (NMF) [12] to find the intra-view clusters directly using the coefficient matrix. (2) Inter-view relationship learning. In this part, the

urn a

40

algorithm learns the consensus clustering result, or consensus partition, by a weighted combination of the information from each view. The foundation is that partitions of all views are close to the consensus partition, and each cluster in each view contributes to the consensus clusters separately. (3) Clusterwise 45

weight learning. In this part, we learn the weights of each cluster within each view. The intuition is that a cluster is separable if its centroid is far from the centroids of all other clusters. Thus, the weight is proportional to the average

Jo

distance between a cluster and other clusters. We further present an effective alternating algorithm to solve the non-convex optimization problem.

50

The contributions of this work are summarized as below:

3

Journal Pre-proof

• Even though multi-view clustering enforces the cluster centers as close as possible, existing multi-view clustering usually assigns a weight to a view

pro of

rather than to a cluster of each view. Our algorithm learns the weights of clusters in each view based on the basis vector. In turn, the weights guide 55

the learning of basis vectors and coefficient vectors. The whole process enforces the centroids of the clusters more separable.

• We explore a better multi-view clustering algorithm to partition multiview data utilizing the clusterwise weights. Experimental results show that the proposed algorithm outperforms typical unweighted multi-view clustering algorithms and weighted multi-view clustering algorithms.

re-

60

2. Related Work

Unweighted multi-view clustering.

Based on the basic algorithms

utilized by these methods, this kind of multi-view clustering can be roughly

65

lP

classified into three categories. The multi-view methods in the first category use spectral clustering as the basic algorithm. They learn the consensus similarity matrix or minimize the divergence of multiple views simultaneously [13][14]. The second category take advantage of NMF in their multi-view methods. They learn

urn a

the consensus indicator matrix or minimize the divergence of multiple indicator matrices [15] [16]. In the last category, they based on various methods[17] [18] 70

[19][20][21][22].

Weighted multi-view clustering.

The algorithm [1] learns a weighted

combination of the kernels in parallel. Under its framework, the algorithm in [3] determines the coefficients by maximizing the kernel alignment between the combined kernel and the indicator matrix approximated kernel. The algorithm [2] [4] use the same weighting scheme as the algorithm[1] to combine multiple non-negative matrix factorizations [12]. The work [23] respectively weights the

Jo

75

data points and feature representations in each view. The algorithms [5] [6] allocate weights for each view automatically without additional weights and penalty parameters. The algorithm [7] proposes a weighted multi-view spec4

Journal Pre-proof

80

tral clustering algorithm based on spectral perturbation theory. The work [24] incorporates the local invariance within each view and the consistency across

pro of

different views into a novel objective function, where the local invariance is defined by a deep metric learning network rather than the traditional Euclidean distance. The algorithms [8] [9] [10] [11] aggregate multiple affinity matrices 85

into a single one. These algorithms are under the framework of multi-kernel learning which is different from (though related with) multi-view clustering.

3. The Proposed Method 3.1. Intra-view Clustering

re-

Multi-view clustering extends single-view clustering which has been extensively studied. We adopt NMF to find the clusters within each view. Given a dataset with n instances which have representations in m views. the instances (v)

(v)

(v)

are to be partitioned into k clusters. Denote X (v) = [x1 , x2 , · · · , xn ] ∈

d R+

×n

as the data matrix in the v-th view, where d(v) is the dimension of

lP

(v)

the v-th view. In the v-th view, NMF tries to minimize the following objective function,

kX (v) − U (v) V (v)T k2F (v)

90

×k

U (v) ≥ 0, V (v) ≥ 0,

vector of the i-th cluster in the v-th view. V (v) ∈ (v)

whose j-th row Vj.

(1) (v)

is the basis matrix whose i-th column U.i

urn a

d where U (v) ∈ R+

s.t.

n×k R+

is the basis

is the coefficient matrix

is the coefficient vector of the j-th instance in the v-th (v)

view. Thanks to the non-nonnegativity of NMF, the instance xj

belongs to

the i-th cluster where i is the index of the maximum in the coefficient vector (v)

Vj. .

3.2. Inter-view Relationship Learning Inter-view relationship learning learns the consensus partition which contains

Jo

95

shared knowledge from multiple views. In multi-view clustering, all partitions of their view are close to a consensus partition. Thus, the partition in each

5

Journal Pre-proof

view contributes to the consensus partition. But as shown in Figure 1, as the 100

distribution of each view varies, the contributions of the clusters are different. (v)

as the contribution of the v-th view, Mii the v-th view. Since the j-th column

(v) V.j

pro of

Denote V ∗ as the consensus coefficient matrix, diagnal matrix M (v) ∈ Rk×k is the weight of the i-th cluster in

and the j-th column V.j∗ indicate the

possibility of each instance belonging to the j-th cluster, V ∗ can be a mixture 105

of V (v) , v ∈ {1, 2, . . . , m} with a certain proportion. Mathematically, we learn V ∗ by V (v) ≈ V ∗ M (v) , i.e., we minimize the following equation,

v=1

kV (v) − V ∗ M (v) k2F



s.t.V ≥ 0, M

(v)

≥ 0,

M

(v)

(2)

= I.

v=1

For each cluster, without loss of generality, the weights are non-negative, and P the sum of its weights in each view is 1. From Eq.(2), we find that v V (v) ≈ P ∗ (v) = V ∗ , i.e., V ∗ is a weighted linear combination of V (v) . vV M

lP

110

m X

re-

m X

By minimizing the difference between the consensus coefficient matrix and

the coefficient matrix in each view, the learned consensus coefficient matrix contains the knowledge in multiple views. In turn the consensus coefficient

urn a

matrix guides the learning of the coefficient vectors in each view. In this way, the knowledge transfers from view to view. 115

3.3. Clusterwise Weight Learning This component learns the exact weight of each cluster based on the properties of the weights. In NMF based clustering, each cluster has a basis vector which can be seen as the centroid of the cluster. Intuitively, a cluster is separable if its centroid is far from the centroids of all other clusters. As a result, a

Jo

separable cluster is weighted more in the proposed algorithm. Mathematically, we minimize Eq.(3). m X k k X X (v) (v) (v) ( (1 − Mii )( (U.i − U.j )2 )) v=1 i=1

j=1

6

(3)

Journal Pre-proof

In Eq.(3),

Pk

(v) j=1 (U.i

(v)

− U.j )2 is the sum of the distance between the cen-

(v)

1 − Mii

120

pro of

troid of the i-th cluster and the centroids of other clusters in the v-th view. Pk (v) (v) If j=1 (U.i − U.j )2 is large, the cluster i is separable. To minimize Eq.(3), (v)

should be minimized, then, Mii

has a large value. Eq.(3) means (v)

that the larger the squared Euclidean distance between U.i

(v)

and U.j , j ∈

{1, 2, . . . , k} is, the higher the weight of cluster i is.

By expressing Eq.(3) in matrix form, the objective function of learning the fine-grained weight is m X v=1

(v)

(v)

(4)

(v)

where P (v) ∈ Rk×k and Pij = 1 − Mii , D(v) ∈ Rk×k and Dii = Pk (v) (v) Q(v) ∈ Rk×k and Qii = j=1 Pji .

re-

125

tr(U (v) (D(v) + Q(v) − 2P (v) )U (v)T

Pk

j=1

(v)

Pij ,

3.4. The Overall Objective Function

lP

Combining the above three aspects, the overall objective function of the proposed method is Eq.(5), m X

kX (v) − U (v) V (v)T k2F + α

v=1 m X v=1



m X

v=1

kV (v) − V ∗ M (v) k2F

tr(U (v) (D(v) + Q(v) − 2P (v) )U (v)T )

urn a



m X

tr(M

(v)

M

(v)T

(5)

)

v=1

s.t.U (v) ≥ 0, V (v) ≥ 0, V ∗ ≥ 0, M (v) ≥ 0,

m X

M (v) = I,

v=1

The last term regularizes the weight of each view, which avoids the trivial solution that only a small number of views contribute to the consensus result. α, β and γ are non-negative hyper-parameters that controls the trade-off between

Jo

130

the aforementioned parts. In Eq.(5), the first part adopts NMF as the basic single-view clustering to

find the clusters within each view, which involves learning the basis vectors U (v) 7

Journal Pre-proof

135

and coefficient vectors V (v) . The second part learns the consensus result, which involves learning the coefficient vectors V (v) , the consensus coefficient vectors

pro of

V ∗ and the weight M (v) . The third part learns the exact weight of each cluster, which involves learning the basis vectors U (v) and the weight M (v) .

We combine all three parts by linear combination, then we optimize the 140

overall objective function by an iterative update procedure in the following subsection, i.e., we optimize one variable while fixing the other variables. We learn the clusterwise weights based on basis vectors and coefficient vectors. In turn, the weights guide the learning of basis vectors and coefficient vectors. The whole process enforces the centroids of the clusters more separable. 3.5. Optimization

re-

145

In this subsection, we present an algorithm to optimize Eq.(5) by an iterative update procedure. Optimizing Eq.(5) is with respect to variables U (v) , V (v) , V ∗ and M (v) .

lP

Fixing U (v) , V ∗ and M (v) , update V (v) . By dropping the terms irrelevant to V (v) , the optimization objective turns to minimize J(V (v) ). J(V (v) ) =kX (v) − U (v) V (v)T k2F + αkV (v) − V ∗ M (v) k2F s.t.V (v) ≥ 0.

urn a

The partial derivation of J(V (v) ) with respect to V (v) is ∂J(V (v) ) = − 2X (v)T U (v) + 2V (v) U (v)T U (v) ∂V (v) + 2αV (v) − 2αV ∗ M (v) .

Similar to the optimization of NMF, we also use the Karush−Kuhn−Tucker(KKT) 150

optimality conditions to optimize the proposed NMF based problem. For a comprehensive discussion of KKT, see Convex Optimization [25].

Jo

Using the KKT complementary condition for the non-negativity of V (v) , we

get

−X (v)T U (v) + V (v) U (v)T U (v) + αV (v) − αV ∗ M (v) = 0.

8

Journal Pre-proof

Based on the above equation, the updating rule for V (v) is as follows,

where and

· ·

X (v)T U (v) + αV ∗ M (v) , V (v) U (v)T U (v) + αV (v)

(6)

pro of

V (v) ⇐ V (v)

are element-wise multiplication and division.

Fixing U (v) , V (v) and M (v) , update V ∗ . By dropping the terms irrelevant

to V ∗ , the optimization objective turns to minimize J(V ∗ ). J(V ∗ ) =α

m X v=1

kV (v) − V ∗ M (v) k2F

s.t.V ∗ ≥ 0,

The partial derivation of J(V ∗ ) with respect to V ∗ is m

re-

∂J(V ∗ ) X (−2αV (v) M (v) + 2αV ∗ M (v) M (v)T ) = 0. = ∂V ∗ v=1 Solving it, we have an exact solution for V ∗ ,

(7)

lP

m m X X M (v) M (v)T )−1 ≥ 0. V (v) M (v) )( V∗ =( v=1

v=1

Fixing V (v) , V ∗ and M (v) , update U (v) . By dropping the terms irrelevant to U (v) , the optimization objective turns to minimize J(U (v) ). J(U (v) ) =

m X

kX (v) − U (v) V (v)T k2F

urn a

v=1 m X



v=1

tr(U (v) (D(v) + Q(v) − 2P (v) )U (v)T )

s.t.U (v) ≥ 0,

The partial derivation of J(U (v) ) with respect to U (v) is

Jo

∂J(U (v) ) = − 2X (v) V (v) + 2U (v) V (v)T V (v) ∂U (v)

Let

∂J(U (v) ) ∂U (v)

+ 2βU (v) (D(v) + Q(v) − 2P (v) ).

= 0, we get the following updating rule for U (v) , U (v) ⇐ U (v)

X (v) V (v) + 2βU (v) P (v) . U (v) V (v)T V (v) + βU (v) (D(v) + Q(v) ) 9

(8)

Journal Pre-proof

Fixing U (v) , V ∗ and V (v) , update M (v) . By dropping the terms irrelevant to M (v) , the optimization objective turns to minimize J(M (v) ). k X m X (v) (v) (v) ( (Fii (Mii )2 − Hii Mii )) i=1 v=1 m X

s.t.

M

pro of

J(M (v) ) =

(v)

= I, M

v=1 (v)

(v)

(9)

≥ 0,

where Fii = α(V ∗T V ∗ )ii + γ, Hii = 2α(V (v)T V ∗ )ii + β(k − 1)(U (v)T U (v) )ii . It is easy to see that minimizing Eq.(9) can be relaxed to minimizing k sub-problems Eq.(10),

(v)

J(Mii ) =

m X (v) (v) (v) (Fii (Mii )2 − Hii Mii )

re-

155

v=1

s.t.

m X v=1

(v)

(10)

(v)

Mii = 1, Mii ≥ 0, (1)

(2)

(m)

lP

Eq.(10) is a standard quadratic programming problem with respect to [Mii , Mii , . . . , Mii ] and can be solved by classical techniques, e.g. the tool quadprog in Matlab. Then, we optimize Eq.(9) by using quadratic programming k times. We give the framework of MCW in Algorithm 1. Algorithm 1 is very similar to the standard NMF and it is trivial to prove its convergence.

urn a

160

3.6. Complexity Analysis

In Algorithm 1, the most time-consuming components are updating V ∗ , U (v) , V (v) and M (v) . The complexity of updating V ∗ is O(mnk 2 ). The complexity

165

of updating U (v) and V (v) is O(nkd(v) ). The computational complexity of conPm structing Eq.(10) (including construct F and H) is O(nk 2 + v=1 (k 2 d(v) )). The computational complexity of solving Eq.(10) is O(m3 ) through quadratic pro-

Jo

gramming, but since the number of views m is a small number, solving Eq.(10) usually does not cost too much time. Then the overall computational complexity Pm of MCW is O( v=1 (nkd(v) )).

10

Journal Pre-proof

Algorithm 1 The MCW algorithm Input: X (v) , α, β

pro of

Output: U (v) , V (v) , V ∗ , M (v) and the clustering results 1: Initialize U (v) , V (v) , V ∗ , M (v) . 2: while Eq.(5) is not converged do

for v = 1 to m do

3: 4:

Fixing U (v) , V ∗ and M (v) , update V (v) by Eq.(6).

5:

Fixing M (v) , V ∗ and V (v) , update U (v) by Eq.(8).

6:

end for

7:

Fixing U (v) , V (v) and M (v) , update V ∗ by Eq.(7).

8:

for i = 1 to k do

(1)

re-

end for

10:

11: end while

4. Experiment 4.1. Datasets

lP

170

(m)

Fixing U (v) , V ∗ and V (v) , update [Mii , . . . , Mii ] by minimizing Eq.(10).

9:

We perform experiments on several benchmark datasets which have been widely used in multi-view learning. Citeseer1 is comprised of 3312 documents over 6 labels, and described by 2 views (content and citations). Cora1 is comprised of 2708 documents over 7 labels. Every document is represented by 2

urn a

175

views (content and citations). Flickr [26]contains 1028 images represented in two views: image-tag view and image-user view. The images are divided into eight categories. Webkb2 is composed of web pages collected from four universities: Cornell, Texas, Washington and Wisconsin. The web pages are classified 180

into 7 categories. Here, we choose four most populous categories(course, faculty, project, student) for clustering. A web page is made of three views: the

Jo

text on it, the anchor text on the hyperlinks pointing to it and the text in its 1

http://membres-liglab.imag.fr/grimal/data.html

2 http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/

11

Journal Pre-proof

title. ALOI3 is a collection of 110250 images of 1000 small objects. We select 100 classes with RGB color histograms and HSV/HSB color histograms as two views.

pro of

185

The statistics of the datasets are summarized in Table 1. Table 1: Statistics of the datasets.

# instance

# view

# cluster

Citeseer

3312

2

6

Cora

2708

2

7

Flickr

1028

2

8

Cornell

226

3

4

Texas

252

3

4

255

3

4

307

3

4

11025

2

100

Washington Wisconsin

lP

ALOI

re-

Dataset

4.2. Compared Methods

We compare the proposed method MCW with the following methods. (1)

190

urn a

Single view clustering: NMF [12]. We run NMF and report the best clustering result on all the views. (2) NMF on data with concatenated features of all the views: ConcatNMF. (3) Unweighted multi-view clustering: MultiNMF [16], Coregspectral [13] and DAIMC [27]. (4) Weighted multi-view clustering: RMKMC [2], LMKC [9], MMNMF [28], MLAN [5], WMSC [7]. All the kernel based baselines use Gaussian kernel to calculate the similarity matrix where the scale parameter is set as the median of the pairwise Euclidean distances between the data points [13]. For the other parameters of the base-

Jo

195

lines, we set them by grid search which is suggested in the original papers. 3 http://elki.dbs.ifi.lmu.de/wiki/DataSets/MultiView

12

Journal Pre-proof

Specifically, for WMSC, the parameters β and η are set by searching the grid {0.02, 0.04, . . . , 0.18, 0.2}. For MMNMF, the parameter β is set by searching the grid {6, 8, 10, 100, 1000}. For LMKC, the parameter λ is set by searching

pro of

200

the grid {2−15 , 2−13 , . . . , 215 }, and the parameter τ is set by searching the grid {0.05, 0.1, . . . , 0.95} ∗ n, where n is the number of samples. For RMKMC, it searches the logarithm of the parameter γ in the range from 0.1 to 2 with in-

cremental step 0.2. For DAIMC, the parameters α and β are set by searching 205

the grid {10−4 , 10−3 , 10−2 , 10−1 , 1, 10, 102 , 103 }. For Coregspectral, it searches the parameter λ in the range from 0.01 to 0.05 with incremental step 0.005. For MultiNMF, it searches the parameter λ in the range from 0.001 to 0.1.

re-

ConcatNMF and MLAN are parameter-free methods.

Two common evaluation metrics, accuracy (ACC) and normalized mutual 210

information (NMI) [29], are used to evaluate the performance of the proposed algorithm and the baselines. To avoid the randomness, we run all the algorithms

Texas

1

0.6

ACC

ACC

0.8

0.4

1/4

1/2

1 α

2

4

1 0.8

0.6

0.6

0.4

0 1/8

8

0.4 0.2

1/4

1/2

1 β

1/4

1/2

1 α

(d) α-NMI

2

4

8

0.4

4

0 1/8

8

1/4

(b) β-ACC

1/2

1 γ

2

4

8

2

4

8

(c) γ-ACC 1 0.8

0.6

0.6

NMI

1

0.8

0.4

0 1/8

Flickr

0.2 2

0.2

Jo

0 1/8

NMI

NMI

0.6

Cora

1

urn a

1

Citeseer

0.8

(a) α-ACC

0.8

Wisconsin

0.2

0.2 0 1/8

Washington

ACC

Cornell

lP

10 times and report their average values.

0.4 0.2

1/4

1/2

1 β

2

4

8

(e) β-NMI

Figure 2: Influence of the parameters.

13

0 1/8

1/4

1/2

1 γ

(f) γ-NMI

Journal Pre-proof

4.3. Parameter Investigation In Figure 2, we explore the effects of the parameters α, β and γ on the clustering performance of MCW. For seven datasets, we repeat MCW 10 times

pro of

215

by carrying out the grid search on one parameter while keeping other parameters fixed. More specifically, we successively set one parameter to search the grid {2−3 , 2−2 , 2−1 , 20 , 21 , 22 , 23 } by fixing α = 1, β = 1, γ = 1. From Figure 2, it can be seen that MCW is more sensitive to the inter-view relationship 220

learning parameter α and fine-grained weight learning parameter β than the weight regularization parameter γ. Then, we set α and β to search the grid {2−3 , 2−2 , 2−1 , 20 , 21 , 22 , 23 } and γ = 1 in the following experiments.

re-

4.4. Results

The clustering results in terms of ACC and NMI are reported in Table 2 225

∼ Table 3. In each column of the two tables, the best result is highlighted in boldface. From the results, the following points can be observed.

lP

Firstly, we compare MCW with the single view method NMF as well as ConcatNMF. MCW performs better than NMF, since MCW exploits the knowledge from multiple view, whereas the single view NMF only utilizes the knowledge 230

within each view. The results indicate that it is reasonable to ensemble multiple

urn a

views. Nevertheless, ConcatNMF performs worse than MCW, which shows that concatenating the views does not help much. Secondly, we compare MCW with the unweighted multi-view clustering algorithms MultiNMF, Coregspectral and DAIMC. MCW outperforms MultiNMF 235

in terms of both ACC and NMI on each dataset. Taking ACC for example, the ACC of MCW raises 4% on the Citeseer dataset, 7% on the Cora dataset, 8% on the Flickr dataset, 15% on the Cornell dataset, 18% on the Texas dataset, 9% on the Washington dataset, 5% on the Wisconsin dataset, 1% on the ALOI dataset.

Jo

MCW performs better than Coregspectral and DAIMC in most cases. As ex-

240

ceptional cases, MCW performs worse than Coregspectral on Citeseer, Cornell and Wisconsin datasets in terms of NMI, and performs worse than DAIMC in terms of both ACC and NMI on the Cornell dataset. However, the NMI 14

pro of

Journal Pre-proof

Table 2: ACC on all the datasets.

MultiNMF Coregspectral DAIMC RMKMC LMKC

Cornell

Texas

Washington

Wisconsin

ALOI

0.375

0.336

0.439

0.643

0.695

0.776

0.608

0.309

±0.067

±0.010

±0.002

±0.052

±0.034

±0.106

±0.075

±0.011

0.385

0.364

0.484

0.532

0.556

±0.070

±0.004

0.429

0.343

±0.011

±0.050

0.463

0.362

±0.000

±0.008

0.421

0.372

±0.077

±0.006

0.283 ±0.000

0.630

0.593

0.428

±0.015

±0.049

±0.055

±0.062

±0.079

±0.082

0.421

0.644

0.582

0.796

0.698

0.503

±0.003

±0.070

±0.140

±0.083

±0.041

±0.023

0.451

0.706

0.679

0.867

0.720

0.521

±0.009

±0.066

±0.140

±0.000

±0.018

±0.022

0.343

0.809

0.713

0.856

0.728

0.537

±0.008

±0.012

±0.017

±0.027

±0.020

±0.016

0.329

0.329

0.531

0.548

0.502

0.684

0.509

±0.000

±0.000

±0.000

±0.000

±0.000

±0.000

±0.003

0.409

0.376

0.456

0.659

0.659

0.639

0.713

0.556

±0.037

±0.000

±0.000

±0.000

±0.000

±0.000

±0.000

±0.000

0.236

0.358

0.376

0.645

0.653

0.659

0.650

0.585

±0.003

±0.009

±0.011

±0.049

±0.182

±0.246

±0.037

±0.000

urn a

MMNMF

Flickr

MLAN

WMSC

0.373

0.357

0.416

0.805

0.657

0.816

0.681

0.518

±0.065

±0.000

±0.000

±0.000

±0.000

±0.000

±0.000

±0.066

0.436

0.382

0.462

0.805

0.761

0.830

0.752

0.580

±0.041

±0.006

±0.012

±0.000

±0.007

±0.013

±0.000

±0.000

0.477

0.412

0.502

0.795

0.766

0.896

0.757

0.605

±0.033

±0.019

±0.013

±0.026

±0.033

±0.013

±0.018

±0.000

Jo

MCW

re-

ConcatNMF

Cora

lP

NMF

Citeseer

15

pro of

Journal Pre-proof

Table 3: NMI on all the datasets.

MultiNMF Coregspectral DAIMC RMKMC LMKC

Cornell

Texas

Washington

Wisconsin

ALOI

0.135

0.206

0.432

0.421

0.322

0.686

0.319

0.494

±0.054

±0.000

±0.004

±0.035

±0.068

±0.064

±0.041

±0.025

0.141

0.180

0.462

0.401

0.319

±0.033

±0.021

0.198

0.173

±0.006

±0.014

0.237

0.228

±0.001

±0.011

0.194

0.222

±0.005

±0.003

0.131 ±0.000

0.580

0.400

0.659

±0.004

±0.025

±0.056

±0.059

±0.081

±0.698

0.369

0.415

0.327

0.658

0.396

0.717

±0.004

±0.097

±0.127

±0.136

±0.073

±0.011

0.436

0.553

0.528

0.679

0.559

0.778

±0.012

±0.005

±0.056

±0.000

±0.043

±0.021

0.320

0.560

0.319

0.676

0.392

0.776

±0.009

±0.015

±0.031

±0.034

±0.033

±0.035

0.195

0.321

0.378

0.396

0.517

0.488

0.738

±0.000

±0.000

±0.000

±0.000

±0.000

±0.000

±0.004

0.191

0.212

0.420

0.436

0.468

0.546

0.525

0.781

±0.000

±0.000

±0.000

±0.000

±0.000

±0.000

±0.000

±0.000

0.082

0.202

0.330

0.439

0.352

0.534

0.340

0.767

±0.006

±0.008

±0.007

±0.051

±0.173

±0.351

±0.054

±0.000

urn a

MMNMF

Flickr

MLAN

WMSC

0.159

0.219

0.402

0.449

0.307

0.601

0.350

0.708

±0.047

±0.000

±0.000

±0.000

±0.000

±0.000

±0.000

±0.022

0.195

0.201

0.481

0.504

0.484

0.676

0.534

0.761

±0.008

±0.006

±0.005

±0.000

±0.000

±0.037

±0.000

± 0.000

0.236

0.242

0.470

0.547

0.548

0.769

0.528

0.785

±0.033

±0.020

±0.004

±0.019

±0.047

±0.030

±0.024

±0.002

Jo

MCW

re-

ConcatNMF

Cora

lP

NMF

Citeseer

16

Journal Pre-proof

of Coregspectral only increased by 3%, the ACC of DAIMC only increased by 1.4% and the NMI of DAIMC only increased by 1.3%. Generally, it can be seen that MCW outperforms the unweighted multi-view clustering algorithms and it

pro of

245

can be concluded that it is rational to discriminate the views.

Finally, we compare MCW with the weighted multi-view clustering algorithms RMKMC, LMKC, MMNMF, MLAN and WMSC. MCW outperforms RMKMC, LMKC and MMNMF in terms of both ACC and NMI on each dataset. 250

Because RMKMC and MMNMF learn the weight by using algebraic method to minimize the objective function of all the views, however, the minimum objective function value does not indicate the optimal clustering result. MCW

re-

performs better than MLAN and WMSC in most cases. As exceptional cases, MCW performs worse than MLAN and WMSC on the Cornell datasets in terms 255

of ACC, and performs worse than WMSC on the Flickr dataset in terms of NMI. However, on the Cornell datasets, the ACC of MLAN and WMSC only increased by 1%; on the Flickr dataset, the NMI of WMSC only increased by 1.1%. Gen-

lP

erally, MCW outperforms the other weighted multi-view clustering algorithms, because MCW assigns each cluster a weight in each view, which can exploit the 260

positive impact of separable cluster and avoid negative impact of inseparable cluster.

urn a

In summary, it can be concluded that MCW performs the best among the competitive methods for clustering multi-view data. 4.5. Convergence Study 265

In Figure 3, we explore the convergence of MCW. We present the objective function values on each dataset with α = 1, β = 1, γ = 1. It can be seen that MCW will converge and its objective function value becomes stable after

Jo

around 50 iterations. 5. Conclusion

270

Considering the diversity of the clusters in each view, assigning the clusters

with different weights is important to multi-view clustering. In this paper, we 17

0

0.2 100

50

(d)

0

50 # Iteration

1.6 1.4

0

50 # Iteration

Obj NMI

0

50 # Iteration

0 100

Obj 0.8 NMI 0.6

0.2 1800

0

50 # Iteration

(f)

Wisconsin 0.6 Obj NMI

2500

0

50 # Iteration

(h)

(g)

Jo

Figure 3: Convergence study.

18

0 100

0.4

3000

2000

Obj 0.4 NMI 0.2

Texas 2600

0.4

0.2 100

NMI

0.5

urn a

2000

NMI

1

Obj Func Value

3000

0.6 NMI

Obj Func Value

NMI

1.8

(e)

Washington Obj Func Value

0.55 0.5 0.45 0.4 Obj 0.35 NMI 0.3 0.25 0.2 0.15 100

Obj Func Value

1700

lP

0.4

2500

Flickr

(c)

NMI

Obj NMI

200

NMI

0.6

400

1000

0 100

50 # Iteration

x 10

Cornell

0.8

Obj Func Value

Obj Func Value

Obj NMI 0.1

2

(b)

ALOI

0 0

0.2

7

6.6

4

0.3

6.8

(a)

600

x 10

7.2

0 100

NMI

50 # Iteration

0.25 0.2 Obj 0.15 NMI 0.1 0.05 0 100

re-

0

Cora

4

Citeseer

NMI

x 10

Obj Func Value

Obj Func Value

5

1.7 1.68 1.66 1.64 1.62 1.6

pro of

Journal Pre-proof

Journal Pre-proof

have proposed a multi-view clustering algorithm (MCW) via clusterwise weights learning. MCW conducts clustering on each view by using non-negative matrix

275

pro of

factorization, and learns the consensus clustering results by a weighted combination of each view. Specifically, we make the weight be proportional to the distance between the cluster and other clusters. We further present an effective alternating algorithm to solve the non-convex optimization problem. Experimental results show that the proposed algorithm outperforms typical single-view clustering algorithms, unweighted multi-view clustering algorithms and existing 280

weighted multi-view clustering algorithms. Therefore, it is effective for clustering multi-view data. In the future, we will further research on applying the

6. Acknowledgements

re-

clusterwise weights on clustering nonlinearly separable data.

This work was supported by National Science Foundation of China (No.61632019; No.61876028; No.61806034; No.61976037).

References References

lP

285

urn a

[1] G. Tzortzis, A. Likas, Kernel-based weighted multi-view clustering., in: 12th IEEE International Conference on Data Mining, 2012, pp. 675–684. 290

[2] X. Cai, F. Nie, H. Huang, Multi-view k-means clustering on big data, in: Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, AAAI Press, 2013, pp. 2598–2604. [3] D. Guo, J. Zhang, X. Liu, Y. Cui, C. Zhao, Multiple kernel learning based

Jo

multi-view spectral clustering, in: Pattern Recognition (ICPR), 2014 22nd 295

International Conference on, IEEE, 2014, pp. 3774–3779.

[4] H. Zhao, Z. Ding, Y. Fu, Multi-view clustering via deep matrix factorization., in: AAAI, 2017, pp. 2921–2927. 19

Journal Pre-proof

[5] F. Nie, G. Cai, X. Li, Multi-view clustering and semi-supervised classification with adaptive neighbours, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017, pp. 2408–2414.

pro of

300

[6] F. Nie, J. Li, X. Li, Self-weighted multiview clustering with multiple graphs, in: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, pp. 2564–2570.

[7] L. Zong, X. Zhang, X. Liu, H. Yu, Weighted multi-view spectral clustering 305

based on spectral perturbation, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 4621–4628.

re-

[8] H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Affinity aggregation for spectral clustering, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 773–780. 310

[9] M. Li, X. Liu, L. Wang, Y. Dou, J. Yin, E. Zhu, Multiple kernel clustering

lP

with local kernel alignment maximization, in: Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, 2016, pp. 4042–4046.

[10] X. Liu, Y. Dou, J. Yin, L. Wang, E. Zhu, Multiple kernel k -means clustering with matrix-induced regularization, in: Proceedings of the Thirtieth AAAI

urn a

315

Conference on Artificial Intelligence,, 2016, pp. 1888–1894. [11] L. Zong, X. Zhang, H. Yu, Q. Zhao, F. Ding, Local linear neighbor reconstruction for multi-view data, Pattern Recognition Letters 84 (2016) 56–62. 320

[12] D. D. Lee, H. S. Seung, Algorithms for non-negative matrix factorization,

Jo

in: Advances in neural information processing systems, 2001, pp. 556–562. [13] A. Kumar, P. Rai, H. Daume, Co-regularized multi-view spectral clustering, in: Advances in Neural Information Processing Systems, 2011, pp. 1413– 1421.

20

Journal Pre-proof

325

[14] C. Lu, S. Yan, Z. Lin, Convex sparse spectral clustering: Single-view to multi-view, IEEE Transactions on Image Processing 25 (6) (2016) 2833–

pro of

2843. [15] W. Cheng, X. Zhang, Z. Guo, Y. Wu, P. F. Sullivan, W. Wang, Flexible and robust co-regularized multi-domain graph clustering, in: Proceedings of the 330

19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2013, pp. 320–328.

[16] J. Liu, C. Wang, J. Gao, J. Han, Multi-view clustering via joint nonnegative matrix factorization, in: Proceedings of the 13th SIAM International

335

re-

Conference on Data Mining, Vol. 13, SIAM, 2013, pp. 252–260.

[17] S. Bickel, T. Scheffer, Multi-View Clustering, in: IEEE International Conference on Data Mining, 2004, pp. 19–26. doi:10.1109/ICDM.2004.10095. [18] K. Chaudhuri, S. M. Kakade, K. Livescu, K. Sridharan, Multi-view clus-

lP

tering via canonical correlation analysis, in: International Conference on Machine Learning, 2009, pp. 129–136. 340

[19] H. Gao, F. Nie, X. Li, H. Huang, Multi-view subspace clustering, in: The IEEE International Conference on Computer Vision (ICCV), 2015.

urn a

[20] X. Peng, Z. Yu, Z. Yi, H. Tang, Constructing the l2-graph for robust subspace learning and subspace clustering, IEEE transactions on cybernetics 47 (4) (2016) 1053–1066. 345

[21] X. Peng, J. Feng, S. Xiao, W.-Y. Yau, J. T. Zhou, S. Yang, Structured autoencoders for subspace clustering, IEEE Transactions on Image Processing 27 (10) (2018) 5076–5086.

Jo

[22] X. Peng, Z. Huang, J. Lv, H. Zhu, J. T. Zhou, Comic: Multi-view clustering without parameter selection, in: International Conference on Machine

350

Learning, 2019, pp. 5092–5101.

21

Journal Pre-proof

[23] Y. M. Xu, C. D. Wang, J. H. Lai, Weighted multi-view clustering with feature selection, Pattern Recognition 53 (2016) 25–35.

pro of

[24] Z. Huang, J. T. Zhou, X. Peng, C. Zhang, H. Zhu, J. Lv, Multi-view spectral clustering network, in: Proceedings of the Twenty-Eighth International 355

Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, 2019, pp. 2563–2569. [25] S. Boyd, L. Vandenberghe, Convex optimization, Cambridge university press, 2004.

[26] J. Liu, C. Wang, J. Gao, Q. Gu, C. Aggarwal, L. Kaplan, J. Han, Gin: A clustering model for capturing dual heterogeneity in networked data,

re-

360

in: Proceedings of 2015 SIAM International Conference on Data Minging, 2015.

[27] M. H. Hu, S. Chen, Doubly aligned incomplete multi-view clustering, in:

365

lP

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018, pp. 2262–2268. [28] L. Zong, X. Zhang, L. Zhao, H. Yu, Q. Zhao, Multi-view clustering via multi-manifold regularized non-negative matrix factorization, Neural Net-

urn a

works 88 (2017) 74–89.

[29] W. Xu, X. Liu, Y. Gong, Document clustering based on non-negative ma370

trix factorization, in: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval,

Jo

2003, pp. 267–273.

22

Journal Pre-proof

CrediT author statement

Jo

urn a

lP

re-

pro of

Qianli Zhao: Conceptualization, Methodology, Writing - Original Draft, Writing - Review & Editing, Linlin Zong: Funding acquisition, Writing - Review & Editing, Writing - Original Draft, Formal analysis Xianchao Zhang: Supervision, Writing - Review & Editing, Resources Xinyue Liu: Project administration, Writing - Review & Editing, Resources Hong Yu: Writing - Review & Editing, Investigation, Resources

Journal Pre-proof

Jo

urn a

lP

re-

pro of

*Conflict of Interest Form