Journal Pre-proof Multi-view clustering via clusterwise weights learning Qianli Zhao, Linlin Zong, Xianchao Zhang, Xinyue Liu, Hong Yu
PII: DOI: Reference:
S0950-7051(19)30672-0 https://doi.org/10.1016/j.knosys.2019.105459 KNOSYS 105459
To appear in:
Knowledge-Based Systems
Received date : 28 May 2019 Revised date : 23 December 2019 Accepted date : 28 December 2019 Please cite this article as: Q. Zhao, L. Zong, X. Zhang et al., Multi-view clustering via clusterwise weights learning, Knowledge-Based Systems (2020), doi: https://doi.org/10.1016/j.knosys.2019.105459. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.
Journal Pre-proof
pro of
Multi-View Clustering via Clusterwise Weights Learning Qianli Zhao, Linlin Zong∗, Xianchao Zhang, Xinyue Liu, Hong Yu
Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian School of Software, Dalian University of Technology, Dalian, China, 116620
Abstract
It is known that the performance of multi-view clustering could be improved by
re-
assigning weights to the views, since different views play different roles in the final clustering results. Nevertheless, we observe that weights could be further refined, since in reality different clusters also have different impacts on finding the correct results. We propose a multi-view clustering algorithm with clusterwise weights (MCW), which assigns a weight on each cluster within each
lP
view. The objective function of MCW consists of three parts: (1) intra-view clustering: clustering each view by using non-negative matrix factorization; (2) inter-view relationship learning: learning the consensus clustering results by a weighted combination of each view; (3) clusterwise weight learning: learning
urn a
the weight of a cluster by making the weight be proportional to the average distance between the cluster and other clusters. We present an effective alternating algorithm to solve the non-convex optimization problem. Experimental results on several benchmark datasets demonstrate the superiority of the proposed algorithm over existing multi-view clustering methods. Keywords: Multi-view, Non-negative matrix factorization, Weight.
∗ Corresponding
Jo
author Email address:
[email protected] (Linlin Zong)
Preprint submitted to Journal of LATEX Templates
December 30, 2019
Journal Pre-proof
view 2
pro of
view 1
Figure 1: Two-view data with three clusters. The instances in view 1 and view 2 with the same color belong to the same latent cluster, the circle denotes the region where most of instances within each cluster are located.
re-
1. Introduction
Traditional clustering methods focus on datasets with a single view. However, one instance can be observed from different views. For example, A web page can be represented with the full text in the page. It can also be represented with the anchor text attached to hyperlink pointing to itself. By precisely
lP
5
describing the dataset, multiple views provide complementary information to facilitate learning tasks. Multi-view clustering, which classifies the similar samples into the same group and dissimilar samples into different groups by mining the multi-view features simultaneously, has attracted increasing attention in the past decade.
urn a
10
The early multi-view clustering algorithms treat all the views equally. However, it is more rational to assign the views with different weights [1]. Recently, several clustering algorithms have been proposed to discriminate different views. A series of algorithms[1] [2] [3] [4] use an algebraic method to minimize the ob15
jective function in all the views. There are also works[5] [6] that allocate weights for each view automatically without additional weights and penalty parameters.
Jo
Zong[7] proposed a weighted multi-view spectral clustering algorithm based on the spectral perturbation theory. Serveral algorithms[8] [9] [10] [11] aggregated multiple affinity matrices into a shared affinity matrix.
20
Note that previous studies on weighted multi-view clustering only assign one
2
Journal Pre-proof
weight to each view. However, in some applications, it is often the case that the distributions of the clusters vary from view to view. In other words, a cluster
pro of
may be easily detected in one view, but hard to be found in another view. For example, Figure 1 shows a typical two-view data with three clusters. In view 1, 25
it is easy to find the blue cluster, but it is difficult to separate the red one from the green cluster. Whereas, in view 2, the green cluster is easily separable, but the red and the blue ones are blended. In this case, if we only allocate one weight to each view while ignore the different impacts of different clusters, the negative information that ”the red and green points may be merged to the same cluster”
30
in view 1 may be transferred to view 2. The similar negative information also
re-
occurs in view 2 with the red and blue points. After the negative information interacts between the two views, all three clusters are hard to distinguish from each other consequently.
In this paper, we propose a multi-view clustering algorithm via clusterwise 35
weights learning (MCW), which assigns a weight to each cluster of each view.
lP
The overall objective function of MCW consists of three parts. (1) Intra-view clustering. In this part, we find the intra-view clusters and update the clusters by the knowledge transferred from other views. We adopt non-negative matrix factorization (NMF) [12] to find the intra-view clusters directly using the coefficient matrix. (2) Inter-view relationship learning. In this part, the
urn a
40
algorithm learns the consensus clustering result, or consensus partition, by a weighted combination of the information from each view. The foundation is that partitions of all views are close to the consensus partition, and each cluster in each view contributes to the consensus clusters separately. (3) Clusterwise 45
weight learning. In this part, we learn the weights of each cluster within each view. The intuition is that a cluster is separable if its centroid is far from the centroids of all other clusters. Thus, the weight is proportional to the average
Jo
distance between a cluster and other clusters. We further present an effective alternating algorithm to solve the non-convex optimization problem.
50
The contributions of this work are summarized as below:
3
Journal Pre-proof
• Even though multi-view clustering enforces the cluster centers as close as possible, existing multi-view clustering usually assigns a weight to a view
pro of
rather than to a cluster of each view. Our algorithm learns the weights of clusters in each view based on the basis vector. In turn, the weights guide 55
the learning of basis vectors and coefficient vectors. The whole process enforces the centroids of the clusters more separable.
• We explore a better multi-view clustering algorithm to partition multiview data utilizing the clusterwise weights. Experimental results show that the proposed algorithm outperforms typical unweighted multi-view clustering algorithms and weighted multi-view clustering algorithms.
re-
60
2. Related Work
Unweighted multi-view clustering.
Based on the basic algorithms
utilized by these methods, this kind of multi-view clustering can be roughly
65
lP
classified into three categories. The multi-view methods in the first category use spectral clustering as the basic algorithm. They learn the consensus similarity matrix or minimize the divergence of multiple views simultaneously [13][14]. The second category take advantage of NMF in their multi-view methods. They learn
urn a
the consensus indicator matrix or minimize the divergence of multiple indicator matrices [15] [16]. In the last category, they based on various methods[17] [18] 70
[19][20][21][22].
Weighted multi-view clustering.
The algorithm [1] learns a weighted
combination of the kernels in parallel. Under its framework, the algorithm in [3] determines the coefficients by maximizing the kernel alignment between the combined kernel and the indicator matrix approximated kernel. The algorithm [2] [4] use the same weighting scheme as the algorithm[1] to combine multiple non-negative matrix factorizations [12]. The work [23] respectively weights the
Jo
75
data points and feature representations in each view. The algorithms [5] [6] allocate weights for each view automatically without additional weights and penalty parameters. The algorithm [7] proposes a weighted multi-view spec4
Journal Pre-proof
80
tral clustering algorithm based on spectral perturbation theory. The work [24] incorporates the local invariance within each view and the consistency across
pro of
different views into a novel objective function, where the local invariance is defined by a deep metric learning network rather than the traditional Euclidean distance. The algorithms [8] [9] [10] [11] aggregate multiple affinity matrices 85
into a single one. These algorithms are under the framework of multi-kernel learning which is different from (though related with) multi-view clustering.
3. The Proposed Method 3.1. Intra-view Clustering
re-
Multi-view clustering extends single-view clustering which has been extensively studied. We adopt NMF to find the clusters within each view. Given a dataset with n instances which have representations in m views. the instances (v)
(v)
(v)
are to be partitioned into k clusters. Denote X (v) = [x1 , x2 , · · · , xn ] ∈
d R+
×n
as the data matrix in the v-th view, where d(v) is the dimension of
lP
(v)
the v-th view. In the v-th view, NMF tries to minimize the following objective function,
kX (v) − U (v) V (v)T k2F (v)
90
×k
U (v) ≥ 0, V (v) ≥ 0,
vector of the i-th cluster in the v-th view. V (v) ∈ (v)
whose j-th row Vj.
(1) (v)
is the basis matrix whose i-th column U.i
urn a
d where U (v) ∈ R+
s.t.
n×k R+
is the basis
is the coefficient matrix
is the coefficient vector of the j-th instance in the v-th (v)
view. Thanks to the non-nonnegativity of NMF, the instance xj
belongs to
the i-th cluster where i is the index of the maximum in the coefficient vector (v)
Vj. .
3.2. Inter-view Relationship Learning Inter-view relationship learning learns the consensus partition which contains
Jo
95
shared knowledge from multiple views. In multi-view clustering, all partitions of their view are close to a consensus partition. Thus, the partition in each
5
Journal Pre-proof
view contributes to the consensus partition. But as shown in Figure 1, as the 100
distribution of each view varies, the contributions of the clusters are different. (v)
as the contribution of the v-th view, Mii the v-th view. Since the j-th column
(v) V.j
pro of
Denote V ∗ as the consensus coefficient matrix, diagnal matrix M (v) ∈ Rk×k is the weight of the i-th cluster in
and the j-th column V.j∗ indicate the
possibility of each instance belonging to the j-th cluster, V ∗ can be a mixture 105
of V (v) , v ∈ {1, 2, . . . , m} with a certain proportion. Mathematically, we learn V ∗ by V (v) ≈ V ∗ M (v) , i.e., we minimize the following equation,
v=1
kV (v) − V ∗ M (v) k2F
∗
s.t.V ≥ 0, M
(v)
≥ 0,
M
(v)
(2)
= I.
v=1
For each cluster, without loss of generality, the weights are non-negative, and P the sum of its weights in each view is 1. From Eq.(2), we find that v V (v) ≈ P ∗ (v) = V ∗ , i.e., V ∗ is a weighted linear combination of V (v) . vV M
lP
110
m X
re-
m X
By minimizing the difference between the consensus coefficient matrix and
the coefficient matrix in each view, the learned consensus coefficient matrix contains the knowledge in multiple views. In turn the consensus coefficient
urn a
matrix guides the learning of the coefficient vectors in each view. In this way, the knowledge transfers from view to view. 115
3.3. Clusterwise Weight Learning This component learns the exact weight of each cluster based on the properties of the weights. In NMF based clustering, each cluster has a basis vector which can be seen as the centroid of the cluster. Intuitively, a cluster is separable if its centroid is far from the centroids of all other clusters. As a result, a
Jo
separable cluster is weighted more in the proposed algorithm. Mathematically, we minimize Eq.(3). m X k k X X (v) (v) (v) ( (1 − Mii )( (U.i − U.j )2 )) v=1 i=1
j=1
6
(3)
Journal Pre-proof
In Eq.(3),
Pk
(v) j=1 (U.i
(v)
− U.j )2 is the sum of the distance between the cen-
(v)
1 − Mii
120
pro of
troid of the i-th cluster and the centroids of other clusters in the v-th view. Pk (v) (v) If j=1 (U.i − U.j )2 is large, the cluster i is separable. To minimize Eq.(3), (v)
should be minimized, then, Mii
has a large value. Eq.(3) means (v)
that the larger the squared Euclidean distance between U.i
(v)
and U.j , j ∈
{1, 2, . . . , k} is, the higher the weight of cluster i is.
By expressing Eq.(3) in matrix form, the objective function of learning the fine-grained weight is m X v=1
(v)
(v)
(4)
(v)
where P (v) ∈ Rk×k and Pij = 1 − Mii , D(v) ∈ Rk×k and Dii = Pk (v) (v) Q(v) ∈ Rk×k and Qii = j=1 Pji .
re-
125
tr(U (v) (D(v) + Q(v) − 2P (v) )U (v)T
Pk
j=1
(v)
Pij ,
3.4. The Overall Objective Function
lP
Combining the above three aspects, the overall objective function of the proposed method is Eq.(5), m X
kX (v) − U (v) V (v)T k2F + α
v=1 m X v=1
+γ
m X
v=1
kV (v) − V ∗ M (v) k2F
tr(U (v) (D(v) + Q(v) − 2P (v) )U (v)T )
urn a
+β
m X
tr(M
(v)
M
(v)T
(5)
)
v=1
s.t.U (v) ≥ 0, V (v) ≥ 0, V ∗ ≥ 0, M (v) ≥ 0,
m X
M (v) = I,
v=1
The last term regularizes the weight of each view, which avoids the trivial solution that only a small number of views contribute to the consensus result. α, β and γ are non-negative hyper-parameters that controls the trade-off between
Jo
130
the aforementioned parts. In Eq.(5), the first part adopts NMF as the basic single-view clustering to
find the clusters within each view, which involves learning the basis vectors U (v) 7
Journal Pre-proof
135
and coefficient vectors V (v) . The second part learns the consensus result, which involves learning the coefficient vectors V (v) , the consensus coefficient vectors
pro of
V ∗ and the weight M (v) . The third part learns the exact weight of each cluster, which involves learning the basis vectors U (v) and the weight M (v) .
We combine all three parts by linear combination, then we optimize the 140
overall objective function by an iterative update procedure in the following subsection, i.e., we optimize one variable while fixing the other variables. We learn the clusterwise weights based on basis vectors and coefficient vectors. In turn, the weights guide the learning of basis vectors and coefficient vectors. The whole process enforces the centroids of the clusters more separable. 3.5. Optimization
re-
145
In this subsection, we present an algorithm to optimize Eq.(5) by an iterative update procedure. Optimizing Eq.(5) is with respect to variables U (v) , V (v) , V ∗ and M (v) .
lP
Fixing U (v) , V ∗ and M (v) , update V (v) . By dropping the terms irrelevant to V (v) , the optimization objective turns to minimize J(V (v) ). J(V (v) ) =kX (v) − U (v) V (v)T k2F + αkV (v) − V ∗ M (v) k2F s.t.V (v) ≥ 0.
urn a
The partial derivation of J(V (v) ) with respect to V (v) is ∂J(V (v) ) = − 2X (v)T U (v) + 2V (v) U (v)T U (v) ∂V (v) + 2αV (v) − 2αV ∗ M (v) .
Similar to the optimization of NMF, we also use the Karush−Kuhn−Tucker(KKT) 150
optimality conditions to optimize the proposed NMF based problem. For a comprehensive discussion of KKT, see Convex Optimization [25].
Jo
Using the KKT complementary condition for the non-negativity of V (v) , we
get
−X (v)T U (v) + V (v) U (v)T U (v) + αV (v) − αV ∗ M (v) = 0.
8
Journal Pre-proof
Based on the above equation, the updating rule for V (v) is as follows,
where and
· ·
X (v)T U (v) + αV ∗ M (v) , V (v) U (v)T U (v) + αV (v)
(6)
pro of
V (v) ⇐ V (v)
are element-wise multiplication and division.
Fixing U (v) , V (v) and M (v) , update V ∗ . By dropping the terms irrelevant
to V ∗ , the optimization objective turns to minimize J(V ∗ ). J(V ∗ ) =α
m X v=1
kV (v) − V ∗ M (v) k2F
s.t.V ∗ ≥ 0,
The partial derivation of J(V ∗ ) with respect to V ∗ is m
re-
∂J(V ∗ ) X (−2αV (v) M (v) + 2αV ∗ M (v) M (v)T ) = 0. = ∂V ∗ v=1 Solving it, we have an exact solution for V ∗ ,
(7)
lP
m m X X M (v) M (v)T )−1 ≥ 0. V (v) M (v) )( V∗ =( v=1
v=1
Fixing V (v) , V ∗ and M (v) , update U (v) . By dropping the terms irrelevant to U (v) , the optimization objective turns to minimize J(U (v) ). J(U (v) ) =
m X
kX (v) − U (v) V (v)T k2F
urn a
v=1 m X
+β
v=1
tr(U (v) (D(v) + Q(v) − 2P (v) )U (v)T )
s.t.U (v) ≥ 0,
The partial derivation of J(U (v) ) with respect to U (v) is
Jo
∂J(U (v) ) = − 2X (v) V (v) + 2U (v) V (v)T V (v) ∂U (v)
Let
∂J(U (v) ) ∂U (v)
+ 2βU (v) (D(v) + Q(v) − 2P (v) ).
= 0, we get the following updating rule for U (v) , U (v) ⇐ U (v)
X (v) V (v) + 2βU (v) P (v) . U (v) V (v)T V (v) + βU (v) (D(v) + Q(v) ) 9
(8)
Journal Pre-proof
Fixing U (v) , V ∗ and V (v) , update M (v) . By dropping the terms irrelevant to M (v) , the optimization objective turns to minimize J(M (v) ). k X m X (v) (v) (v) ( (Fii (Mii )2 − Hii Mii )) i=1 v=1 m X
s.t.
M
pro of
J(M (v) ) =
(v)
= I, M
v=1 (v)
(v)
(9)
≥ 0,
where Fii = α(V ∗T V ∗ )ii + γ, Hii = 2α(V (v)T V ∗ )ii + β(k − 1)(U (v)T U (v) )ii . It is easy to see that minimizing Eq.(9) can be relaxed to minimizing k sub-problems Eq.(10),
(v)
J(Mii ) =
m X (v) (v) (v) (Fii (Mii )2 − Hii Mii )
re-
155
v=1
s.t.
m X v=1
(v)
(10)
(v)
Mii = 1, Mii ≥ 0, (1)
(2)
(m)
lP
Eq.(10) is a standard quadratic programming problem with respect to [Mii , Mii , . . . , Mii ] and can be solved by classical techniques, e.g. the tool quadprog in Matlab. Then, we optimize Eq.(9) by using quadratic programming k times. We give the framework of MCW in Algorithm 1. Algorithm 1 is very similar to the standard NMF and it is trivial to prove its convergence.
urn a
160
3.6. Complexity Analysis
In Algorithm 1, the most time-consuming components are updating V ∗ , U (v) , V (v) and M (v) . The complexity of updating V ∗ is O(mnk 2 ). The complexity
165
of updating U (v) and V (v) is O(nkd(v) ). The computational complexity of conPm structing Eq.(10) (including construct F and H) is O(nk 2 + v=1 (k 2 d(v) )). The computational complexity of solving Eq.(10) is O(m3 ) through quadratic pro-
Jo
gramming, but since the number of views m is a small number, solving Eq.(10) usually does not cost too much time. Then the overall computational complexity Pm of MCW is O( v=1 (nkd(v) )).
10
Journal Pre-proof
Algorithm 1 The MCW algorithm Input: X (v) , α, β
pro of
Output: U (v) , V (v) , V ∗ , M (v) and the clustering results 1: Initialize U (v) , V (v) , V ∗ , M (v) . 2: while Eq.(5) is not converged do
for v = 1 to m do
3: 4:
Fixing U (v) , V ∗ and M (v) , update V (v) by Eq.(6).
5:
Fixing M (v) , V ∗ and V (v) , update U (v) by Eq.(8).
6:
end for
7:
Fixing U (v) , V (v) and M (v) , update V ∗ by Eq.(7).
8:
for i = 1 to k do
(1)
re-
end for
10:
11: end while
4. Experiment 4.1. Datasets
lP
170
(m)
Fixing U (v) , V ∗ and V (v) , update [Mii , . . . , Mii ] by minimizing Eq.(10).
9:
We perform experiments on several benchmark datasets which have been widely used in multi-view learning. Citeseer1 is comprised of 3312 documents over 6 labels, and described by 2 views (content and citations). Cora1 is comprised of 2708 documents over 7 labels. Every document is represented by 2
urn a
175
views (content and citations). Flickr [26]contains 1028 images represented in two views: image-tag view and image-user view. The images are divided into eight categories. Webkb2 is composed of web pages collected from four universities: Cornell, Texas, Washington and Wisconsin. The web pages are classified 180
into 7 categories. Here, we choose four most populous categories(course, faculty, project, student) for clustering. A web page is made of three views: the
Jo
text on it, the anchor text on the hyperlinks pointing to it and the text in its 1
http://membres-liglab.imag.fr/grimal/data.html
2 http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
11
Journal Pre-proof
title. ALOI3 is a collection of 110250 images of 1000 small objects. We select 100 classes with RGB color histograms and HSV/HSB color histograms as two views.
pro of
185
The statistics of the datasets are summarized in Table 1. Table 1: Statistics of the datasets.
# instance
# view
# cluster
Citeseer
3312
2
6
Cora
2708
2
7
Flickr
1028
2
8
Cornell
226
3
4
Texas
252
3
4
255
3
4
307
3
4
11025
2
100
Washington Wisconsin
lP
ALOI
re-
Dataset
4.2. Compared Methods
We compare the proposed method MCW with the following methods. (1)
190
urn a
Single view clustering: NMF [12]. We run NMF and report the best clustering result on all the views. (2) NMF on data with concatenated features of all the views: ConcatNMF. (3) Unweighted multi-view clustering: MultiNMF [16], Coregspectral [13] and DAIMC [27]. (4) Weighted multi-view clustering: RMKMC [2], LMKC [9], MMNMF [28], MLAN [5], WMSC [7]. All the kernel based baselines use Gaussian kernel to calculate the similarity matrix where the scale parameter is set as the median of the pairwise Euclidean distances between the data points [13]. For the other parameters of the base-
Jo
195
lines, we set them by grid search which is suggested in the original papers. 3 http://elki.dbs.ifi.lmu.de/wiki/DataSets/MultiView
12
Journal Pre-proof
Specifically, for WMSC, the parameters β and η are set by searching the grid {0.02, 0.04, . . . , 0.18, 0.2}. For MMNMF, the parameter β is set by searching the grid {6, 8, 10, 100, 1000}. For LMKC, the parameter λ is set by searching
pro of
200
the grid {2−15 , 2−13 , . . . , 215 }, and the parameter τ is set by searching the grid {0.05, 0.1, . . . , 0.95} ∗ n, where n is the number of samples. For RMKMC, it searches the logarithm of the parameter γ in the range from 0.1 to 2 with in-
cremental step 0.2. For DAIMC, the parameters α and β are set by searching 205
the grid {10−4 , 10−3 , 10−2 , 10−1 , 1, 10, 102 , 103 }. For Coregspectral, it searches the parameter λ in the range from 0.01 to 0.05 with incremental step 0.005. For MultiNMF, it searches the parameter λ in the range from 0.001 to 0.1.
re-
ConcatNMF and MLAN are parameter-free methods.
Two common evaluation metrics, accuracy (ACC) and normalized mutual 210
information (NMI) [29], are used to evaluate the performance of the proposed algorithm and the baselines. To avoid the randomness, we run all the algorithms
Texas
1
0.6
ACC
ACC
0.8
0.4
1/4
1/2
1 α
2
4
1 0.8
0.6
0.6
0.4
0 1/8
8
0.4 0.2
1/4
1/2
1 β
1/4
1/2
1 α
(d) α-NMI
2
4
8
0.4
4
0 1/8
8
1/4
(b) β-ACC
1/2
1 γ
2
4
8
2
4
8
(c) γ-ACC 1 0.8
0.6
0.6
NMI
1
0.8
0.4
0 1/8
Flickr
0.2 2
0.2
Jo
0 1/8
NMI
NMI
0.6
Cora
1
urn a
1
Citeseer
0.8
(a) α-ACC
0.8
Wisconsin
0.2
0.2 0 1/8
Washington
ACC
Cornell
lP
10 times and report their average values.
0.4 0.2
1/4
1/2
1 β
2
4
8
(e) β-NMI
Figure 2: Influence of the parameters.
13
0 1/8
1/4
1/2
1 γ
(f) γ-NMI
Journal Pre-proof
4.3. Parameter Investigation In Figure 2, we explore the effects of the parameters α, β and γ on the clustering performance of MCW. For seven datasets, we repeat MCW 10 times
pro of
215
by carrying out the grid search on one parameter while keeping other parameters fixed. More specifically, we successively set one parameter to search the grid {2−3 , 2−2 , 2−1 , 20 , 21 , 22 , 23 } by fixing α = 1, β = 1, γ = 1. From Figure 2, it can be seen that MCW is more sensitive to the inter-view relationship 220
learning parameter α and fine-grained weight learning parameter β than the weight regularization parameter γ. Then, we set α and β to search the grid {2−3 , 2−2 , 2−1 , 20 , 21 , 22 , 23 } and γ = 1 in the following experiments.
re-
4.4. Results
The clustering results in terms of ACC and NMI are reported in Table 2 225
∼ Table 3. In each column of the two tables, the best result is highlighted in boldface. From the results, the following points can be observed.
lP
Firstly, we compare MCW with the single view method NMF as well as ConcatNMF. MCW performs better than NMF, since MCW exploits the knowledge from multiple view, whereas the single view NMF only utilizes the knowledge 230
within each view. The results indicate that it is reasonable to ensemble multiple
urn a
views. Nevertheless, ConcatNMF performs worse than MCW, which shows that concatenating the views does not help much. Secondly, we compare MCW with the unweighted multi-view clustering algorithms MultiNMF, Coregspectral and DAIMC. MCW outperforms MultiNMF 235
in terms of both ACC and NMI on each dataset. Taking ACC for example, the ACC of MCW raises 4% on the Citeseer dataset, 7% on the Cora dataset, 8% on the Flickr dataset, 15% on the Cornell dataset, 18% on the Texas dataset, 9% on the Washington dataset, 5% on the Wisconsin dataset, 1% on the ALOI dataset.
Jo
MCW performs better than Coregspectral and DAIMC in most cases. As ex-
240
ceptional cases, MCW performs worse than Coregspectral on Citeseer, Cornell and Wisconsin datasets in terms of NMI, and performs worse than DAIMC in terms of both ACC and NMI on the Cornell dataset. However, the NMI 14
pro of
Journal Pre-proof
Table 2: ACC on all the datasets.
MultiNMF Coregspectral DAIMC RMKMC LMKC
Cornell
Texas
Washington
Wisconsin
ALOI
0.375
0.336
0.439
0.643
0.695
0.776
0.608
0.309
±0.067
±0.010
±0.002
±0.052
±0.034
±0.106
±0.075
±0.011
0.385
0.364
0.484
0.532
0.556
±0.070
±0.004
0.429
0.343
±0.011
±0.050
0.463
0.362
±0.000
±0.008
0.421
0.372
±0.077
±0.006
0.283 ±0.000
0.630
0.593
0.428
±0.015
±0.049
±0.055
±0.062
±0.079
±0.082
0.421
0.644
0.582
0.796
0.698
0.503
±0.003
±0.070
±0.140
±0.083
±0.041
±0.023
0.451
0.706
0.679
0.867
0.720
0.521
±0.009
±0.066
±0.140
±0.000
±0.018
±0.022
0.343
0.809
0.713
0.856
0.728
0.537
±0.008
±0.012
±0.017
±0.027
±0.020
±0.016
0.329
0.329
0.531
0.548
0.502
0.684
0.509
±0.000
±0.000
±0.000
±0.000
±0.000
±0.000
±0.003
0.409
0.376
0.456
0.659
0.659
0.639
0.713
0.556
±0.037
±0.000
±0.000
±0.000
±0.000
±0.000
±0.000
±0.000
0.236
0.358
0.376
0.645
0.653
0.659
0.650
0.585
±0.003
±0.009
±0.011
±0.049
±0.182
±0.246
±0.037
±0.000
urn a
MMNMF
Flickr
MLAN
WMSC
0.373
0.357
0.416
0.805
0.657
0.816
0.681
0.518
±0.065
±0.000
±0.000
±0.000
±0.000
±0.000
±0.000
±0.066
0.436
0.382
0.462
0.805
0.761
0.830
0.752
0.580
±0.041
±0.006
±0.012
±0.000
±0.007
±0.013
±0.000
±0.000
0.477
0.412
0.502
0.795
0.766
0.896
0.757
0.605
±0.033
±0.019
±0.013
±0.026
±0.033
±0.013
±0.018
±0.000
Jo
MCW
re-
ConcatNMF
Cora
lP
NMF
Citeseer
15
pro of
Journal Pre-proof
Table 3: NMI on all the datasets.
MultiNMF Coregspectral DAIMC RMKMC LMKC
Cornell
Texas
Washington
Wisconsin
ALOI
0.135
0.206
0.432
0.421
0.322
0.686
0.319
0.494
±0.054
±0.000
±0.004
±0.035
±0.068
±0.064
±0.041
±0.025
0.141
0.180
0.462
0.401
0.319
±0.033
±0.021
0.198
0.173
±0.006
±0.014
0.237
0.228
±0.001
±0.011
0.194
0.222
±0.005
±0.003
0.131 ±0.000
0.580
0.400
0.659
±0.004
±0.025
±0.056
±0.059
±0.081
±0.698
0.369
0.415
0.327
0.658
0.396
0.717
±0.004
±0.097
±0.127
±0.136
±0.073
±0.011
0.436
0.553
0.528
0.679
0.559
0.778
±0.012
±0.005
±0.056
±0.000
±0.043
±0.021
0.320
0.560
0.319
0.676
0.392
0.776
±0.009
±0.015
±0.031
±0.034
±0.033
±0.035
0.195
0.321
0.378
0.396
0.517
0.488
0.738
±0.000
±0.000
±0.000
±0.000
±0.000
±0.000
±0.004
0.191
0.212
0.420
0.436
0.468
0.546
0.525
0.781
±0.000
±0.000
±0.000
±0.000
±0.000
±0.000
±0.000
±0.000
0.082
0.202
0.330
0.439
0.352
0.534
0.340
0.767
±0.006
±0.008
±0.007
±0.051
±0.173
±0.351
±0.054
±0.000
urn a
MMNMF
Flickr
MLAN
WMSC
0.159
0.219
0.402
0.449
0.307
0.601
0.350
0.708
±0.047
±0.000
±0.000
±0.000
±0.000
±0.000
±0.000
±0.022
0.195
0.201
0.481
0.504
0.484
0.676
0.534
0.761
±0.008
±0.006
±0.005
±0.000
±0.000
±0.037
±0.000
± 0.000
0.236
0.242
0.470
0.547
0.548
0.769
0.528
0.785
±0.033
±0.020
±0.004
±0.019
±0.047
±0.030
±0.024
±0.002
Jo
MCW
re-
ConcatNMF
Cora
lP
NMF
Citeseer
16
Journal Pre-proof
of Coregspectral only increased by 3%, the ACC of DAIMC only increased by 1.4% and the NMI of DAIMC only increased by 1.3%. Generally, it can be seen that MCW outperforms the unweighted multi-view clustering algorithms and it
pro of
245
can be concluded that it is rational to discriminate the views.
Finally, we compare MCW with the weighted multi-view clustering algorithms RMKMC, LMKC, MMNMF, MLAN and WMSC. MCW outperforms RMKMC, LMKC and MMNMF in terms of both ACC and NMI on each dataset. 250
Because RMKMC and MMNMF learn the weight by using algebraic method to minimize the objective function of all the views, however, the minimum objective function value does not indicate the optimal clustering result. MCW
re-
performs better than MLAN and WMSC in most cases. As exceptional cases, MCW performs worse than MLAN and WMSC on the Cornell datasets in terms 255
of ACC, and performs worse than WMSC on the Flickr dataset in terms of NMI. However, on the Cornell datasets, the ACC of MLAN and WMSC only increased by 1%; on the Flickr dataset, the NMI of WMSC only increased by 1.1%. Gen-
lP
erally, MCW outperforms the other weighted multi-view clustering algorithms, because MCW assigns each cluster a weight in each view, which can exploit the 260
positive impact of separable cluster and avoid negative impact of inseparable cluster.
urn a
In summary, it can be concluded that MCW performs the best among the competitive methods for clustering multi-view data. 4.5. Convergence Study 265
In Figure 3, we explore the convergence of MCW. We present the objective function values on each dataset with α = 1, β = 1, γ = 1. It can be seen that MCW will converge and its objective function value becomes stable after
Jo
around 50 iterations. 5. Conclusion
270
Considering the diversity of the clusters in each view, assigning the clusters
with different weights is important to multi-view clustering. In this paper, we 17
0
0.2 100
50
(d)
0
50 # Iteration
1.6 1.4
0
50 # Iteration
Obj NMI
0
50 # Iteration
0 100
Obj 0.8 NMI 0.6
0.2 1800
0
50 # Iteration
(f)
Wisconsin 0.6 Obj NMI
2500
0
50 # Iteration
(h)
(g)
Jo
Figure 3: Convergence study.
18
0 100
0.4
3000
2000
Obj 0.4 NMI 0.2
Texas 2600
0.4
0.2 100
NMI
0.5
urn a
2000
NMI
1
Obj Func Value
3000
0.6 NMI
Obj Func Value
NMI
1.8
(e)
Washington Obj Func Value
0.55 0.5 0.45 0.4 Obj 0.35 NMI 0.3 0.25 0.2 0.15 100
Obj Func Value
1700
lP
0.4
2500
Flickr
(c)
NMI
Obj NMI
200
NMI
0.6
400
1000
0 100
50 # Iteration
x 10
Cornell
0.8
Obj Func Value
Obj Func Value
Obj NMI 0.1
2
(b)
ALOI
0 0
0.2
7
6.6
4
0.3
6.8
(a)
600
x 10
7.2
0 100
NMI
50 # Iteration
0.25 0.2 Obj 0.15 NMI 0.1 0.05 0 100
re-
0
Cora
4
Citeseer
NMI
x 10
Obj Func Value
Obj Func Value
5
1.7 1.68 1.66 1.64 1.62 1.6
pro of
Journal Pre-proof
Journal Pre-proof
have proposed a multi-view clustering algorithm (MCW) via clusterwise weights learning. MCW conducts clustering on each view by using non-negative matrix
275
pro of
factorization, and learns the consensus clustering results by a weighted combination of each view. Specifically, we make the weight be proportional to the distance between the cluster and other clusters. We further present an effective alternating algorithm to solve the non-convex optimization problem. Experimental results show that the proposed algorithm outperforms typical single-view clustering algorithms, unweighted multi-view clustering algorithms and existing 280
weighted multi-view clustering algorithms. Therefore, it is effective for clustering multi-view data. In the future, we will further research on applying the
6. Acknowledgements
re-
clusterwise weights on clustering nonlinearly separable data.
This work was supported by National Science Foundation of China (No.61632019; No.61876028; No.61806034; No.61976037).
References References
lP
285
urn a
[1] G. Tzortzis, A. Likas, Kernel-based weighted multi-view clustering., in: 12th IEEE International Conference on Data Mining, 2012, pp. 675–684. 290
[2] X. Cai, F. Nie, H. Huang, Multi-view k-means clustering on big data, in: Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, AAAI Press, 2013, pp. 2598–2604. [3] D. Guo, J. Zhang, X. Liu, Y. Cui, C. Zhao, Multiple kernel learning based
Jo
multi-view spectral clustering, in: Pattern Recognition (ICPR), 2014 22nd 295
International Conference on, IEEE, 2014, pp. 3774–3779.
[4] H. Zhao, Z. Ding, Y. Fu, Multi-view clustering via deep matrix factorization., in: AAAI, 2017, pp. 2921–2927. 19
Journal Pre-proof
[5] F. Nie, G. Cai, X. Li, Multi-view clustering and semi-supervised classification with adaptive neighbours, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017, pp. 2408–2414.
pro of
300
[6] F. Nie, J. Li, X. Li, Self-weighted multiview clustering with multiple graphs, in: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, pp. 2564–2570.
[7] L. Zong, X. Zhang, X. Liu, H. Yu, Weighted multi-view spectral clustering 305
based on spectral perturbation, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 4621–4628.
re-
[8] H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Affinity aggregation for spectral clustering, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 773–780. 310
[9] M. Li, X. Liu, L. Wang, Y. Dou, J. Yin, E. Zhu, Multiple kernel clustering
lP
with local kernel alignment maximization, in: Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, 2016, pp. 4042–4046.
[10] X. Liu, Y. Dou, J. Yin, L. Wang, E. Zhu, Multiple kernel k -means clustering with matrix-induced regularization, in: Proceedings of the Thirtieth AAAI
urn a
315
Conference on Artificial Intelligence,, 2016, pp. 1888–1894. [11] L. Zong, X. Zhang, H. Yu, Q. Zhao, F. Ding, Local linear neighbor reconstruction for multi-view data, Pattern Recognition Letters 84 (2016) 56–62. 320
[12] D. D. Lee, H. S. Seung, Algorithms for non-negative matrix factorization,
Jo
in: Advances in neural information processing systems, 2001, pp. 556–562. [13] A. Kumar, P. Rai, H. Daume, Co-regularized multi-view spectral clustering, in: Advances in Neural Information Processing Systems, 2011, pp. 1413– 1421.
20
Journal Pre-proof
325
[14] C. Lu, S. Yan, Z. Lin, Convex sparse spectral clustering: Single-view to multi-view, IEEE Transactions on Image Processing 25 (6) (2016) 2833–
pro of
2843. [15] W. Cheng, X. Zhang, Z. Guo, Y. Wu, P. F. Sullivan, W. Wang, Flexible and robust co-regularized multi-domain graph clustering, in: Proceedings of the 330
19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2013, pp. 320–328.
[16] J. Liu, C. Wang, J. Gao, J. Han, Multi-view clustering via joint nonnegative matrix factorization, in: Proceedings of the 13th SIAM International
335
re-
Conference on Data Mining, Vol. 13, SIAM, 2013, pp. 252–260.
[17] S. Bickel, T. Scheffer, Multi-View Clustering, in: IEEE International Conference on Data Mining, 2004, pp. 19–26. doi:10.1109/ICDM.2004.10095. [18] K. Chaudhuri, S. M. Kakade, K. Livescu, K. Sridharan, Multi-view clus-
lP
tering via canonical correlation analysis, in: International Conference on Machine Learning, 2009, pp. 129–136. 340
[19] H. Gao, F. Nie, X. Li, H. Huang, Multi-view subspace clustering, in: The IEEE International Conference on Computer Vision (ICCV), 2015.
urn a
[20] X. Peng, Z. Yu, Z. Yi, H. Tang, Constructing the l2-graph for robust subspace learning and subspace clustering, IEEE transactions on cybernetics 47 (4) (2016) 1053–1066. 345
[21] X. Peng, J. Feng, S. Xiao, W.-Y. Yau, J. T. Zhou, S. Yang, Structured autoencoders for subspace clustering, IEEE Transactions on Image Processing 27 (10) (2018) 5076–5086.
Jo
[22] X. Peng, Z. Huang, J. Lv, H. Zhu, J. T. Zhou, Comic: Multi-view clustering without parameter selection, in: International Conference on Machine
350
Learning, 2019, pp. 5092–5101.
21
Journal Pre-proof
[23] Y. M. Xu, C. D. Wang, J. H. Lai, Weighted multi-view clustering with feature selection, Pattern Recognition 53 (2016) 25–35.
pro of
[24] Z. Huang, J. T. Zhou, X. Peng, C. Zhang, H. Zhu, J. Lv, Multi-view spectral clustering network, in: Proceedings of the Twenty-Eighth International 355
Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, 2019, pp. 2563–2569. [25] S. Boyd, L. Vandenberghe, Convex optimization, Cambridge university press, 2004.
[26] J. Liu, C. Wang, J. Gao, Q. Gu, C. Aggarwal, L. Kaplan, J. Han, Gin: A clustering model for capturing dual heterogeneity in networked data,
re-
360
in: Proceedings of 2015 SIAM International Conference on Data Minging, 2015.
[27] M. H. Hu, S. Chen, Doubly aligned incomplete multi-view clustering, in:
365
lP
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018, pp. 2262–2268. [28] L. Zong, X. Zhang, L. Zhao, H. Yu, Q. Zhao, Multi-view clustering via multi-manifold regularized non-negative matrix factorization, Neural Net-
urn a
works 88 (2017) 74–89.
[29] W. Xu, X. Liu, Y. Gong, Document clustering based on non-negative ma370
trix factorization, in: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval,
Jo
2003, pp. 267–273.
22
Journal Pre-proof
CrediT author statement
Jo
urn a
lP
re-
pro of
Qianli Zhao: Conceptualization, Methodology, Writing - Original Draft, Writing - Review & Editing, Linlin Zong: Funding acquisition, Writing - Review & Editing, Writing - Original Draft, Formal analysis Xianchao Zhang: Supervision, Writing - Review & Editing, Resources Xinyue Liu: Project administration, Writing - Review & Editing, Resources Hong Yu: Writing - Review & Editing, Investigation, Resources
Journal Pre-proof
Jo
urn a
lP
re-
pro of
*Conflict of Interest Form