Deep low-rank subspace ensemble for multi-view clustering

Deep low-rank subspace ensemble for multi-view clustering

Accepted Manuscript Deep Low-Rank Subspace Ensemble for Multi-View Clustering Zhe Xue, Junping Du, Dawei Du, Siwei Lyu PII: DOI: Reference: S0020-02...

823KB Sizes 0 Downloads 53 Views

Accepted Manuscript

Deep Low-Rank Subspace Ensemble for Multi-View Clustering Zhe Xue, Junping Du, Dawei Du, Siwei Lyu PII: DOI: Reference:

S0020-0255(19)30027-1 https://doi.org/10.1016/j.ins.2019.01.018 INS 14209

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

14 July 2018 8 January 2019 9 January 2019

Please cite this article as: Zhe Xue, Junping Du, Dawei Du, Siwei Lyu, Rank Subspace Ensemble for Multi-View Clustering, Information Sciences https://doi.org/10.1016/j.ins.2019.01.018

Deep Low(2019), doi:

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Deep Low-Rank Subspace Ensemble for Multi-View Clustering

CR IP T

Zhe Xuea , Junping Dua,∗, Dawei Dub , Siwei Lyub a

Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, School of Computer Science, Beijing University of Posts and Telecommunications, 100876, Beijing, China b Computer Science Department, University at Albany, State University of New York, Albany, NY 12222, USA.

AN US

Abstract

Multi-view clustering aims to incorporate complementary information from different data views for more effective clustering. However, it is difficult to obtain the true categories of data based on complex distribution and diversified latent at-

M

tributes of multi-view data. In this paper, we propose a new multi-view clustering method that integrates deep matrix factorization, low-rank subspace learning, and

ED

multiple subspace ensemble in a unified framework, which we term as the Deep Low-Rank Subspace Ensemble (DLRSE) method. DLRSE first employs deep ma-

PT

trix factorization to capture diverse and hierarchical structures in data for robust multi-view multi-layer data representations. Then, low-rank subspace represen-

CE

tations are learned from the extracted factors to further reveal their correlations, from which more explicit clustering structure can be obtained. We further develop a subspace ensemble learning method with structured sparsity regularization to ag-

AC

gregate different subspaces into a consensus subspace, which can incorporate the intrinsic clustering structure across multiple views better. Extensive experiments ∗

Corresponding author (email:[email protected])

Preprint submitted to Information Sciences

January 10, 2019

ACCEPTED MANUSCRIPT

on several datasets demonstrate that the proposed method can effectively explore the diversified clustering structure inherent in data and exploit multi-view com-

CR IP T

plementary information, and achieve considerable improvement in performance over existing methods. Keywords:

multi-view clustering, deep matrix factorization, low-rank subspace, ensemble learning

AN US

1. Introduction

Clustering is an important unsupervised learning problem prevalent in a wide range of real world applications of data mining and computer vision [17, 44]. Clustering algorithms predicate on the underlying data representation, which de-

M

termines the ultimate performance of the algorithm on the given dataset. While traditional clustering algorithms are designed for datasets using a single type of

ED

feature, recent developments in feature extractions provides the possibility to associate multiple types of features to each data point. These different feature types

PT

are commonly referred to as the multiple views to each dataset. For instance, there exist many different types of features that can be extracted from images, such as

CE

SIFT [25], LBP [31], HOG [6], or more recently, deep learning based image features [12]. The multiple views of the same dataset usually provide complementary

AC

information that, when combined together, can improve the overall clustering performance than using any single view. It is for this reason that recent years have witnessed a flux of research works dedicated to multi-view clustering [19, 5, 46]. Multi-view clustering methods generally make use of the clustering structure

of multiple views to derive the true clustering structure. Some methods directly 2

ACCEPTED MANUSCRIPT

combine multi-view features [40] or leverage graphs to fuse different views [19, 4] to achieve a better clustering. Usually, instead of directly performing clustering

CR IP T

over multiple features, subspace learning methods [5, 9, 43] is capable of learning low-rank subspaces, in which more reliable clustering results can be obtained. Recently, a series of new multi-view clustering methods use “deep” models to handle

datasets with complex latent cluster structures [1, 42, 48]. For instance, DCCAE [42] combines canonical correlation analysis (CCA) and auto-encoder for multi-

AN US

view feature learning and achieves better clustering performance than CCA and kernel CCA. An alternative approach is presented in [48], which obtains abstract data representation and hierarchical structure using deep matrix factorization [38]. However, these models mainly use the last layer of the deep models and ignore the information in the intermediate layers, which is also useful to improve the overall

M

clustering performance.

In this paper, we propose the Deep Low-Rank Subspace Ensemble (DLRSE)

ED

method for multi-view clustering, the general architecture of which is illustrated in Figure 1. By combining the power of deep matrix factorization, low-rank sub-

PT

space learning and subspace ensemble learning, DLRSE is capable of extracting effective multi-layer low-rank subspaces from multi-view data and exploring the

CE

intrinsic clustering structure among them. While existing multi-view clustering methods exploit complementary information in different views, DLRSE also uses the complementary information gleaned from different layers of the deep learning

AC

model. This enables it to recover more accurate and stable clustering structure. Specifically, we first extract latent factors across multiple views and multiple layers using deep matrix factorization, such that these latent factors can effectively capture diverse and hierarchical structures in data. Then, low-rank subspace rep-

3

ACCEPTED MANUSCRIPT

resentations are obtained from the extracted factors to further refine the correlation structures, from which more explicit clustering structure can be obtained. The fi-

CR IP T

nal clustering is achieved with subspace ensemble learning to aggregate different subspaces into a low-dimensional consensus subspace, which can better reveal the

intrinsic clustering structure across multiple data views. In this step, we use the

structured sparsity regularization, such that reliable subspaces are recovered with higher importance while noisy and unreliable ones are discounted. Experiments

AN US

conducted on several image datasets demonstrate the effectiveness of our method. The main contributions of this work are summarized as follows: • We integrate deep matrix factorization, low-rank subspace learning, and multiple subspace ensemble in a unified framework to learn complemen-

M

tary information from each other for robust multi-view clustering. • Instead of using only the top layer of the deep model as in the previous

ED

methods, we incorporate the intermediate layers to obtain comprehensive and diverse clustering structures.

PT

• We introduce structured sparsity regularization to promote reliable sub-

spaces and suppress unreliable ones from different views and layers, leading

CE

to more efficient and accurate data representation.

The remainder of this paper is organized as follows. Related work on multi-

AC

view clustering is reviewed in Section 2. The methodology of the proposed approach is presented in Section 3. The optimization of our method is elaborated in Section 4. Experiments on four challenging benchmarks are evaluated in Section 5. Concluding remarks are given in Section 6.

4

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 1: The overall framework of DLRSE, with three major components as deep matrix factorization, low rank subspace learning, and multi-view multi-layer ensemble learning. In multi-view deep matrix factorization, different colors denote different views. Each layer of the deep model

ED

contains a base matrix Z and a coefficient matrix H. In deep low-rank subspace learning, low-rank representation S is learned from H of each layer. Finally, a consensus subspace F is obtained by

PT

performing multi-view multi-layer ensemble learning, which is further used for clustering.

CE

2. Related Works

In this section, we briefly discuss previous works that are closely related to

our method. Based on how to integrate different views, they can be classified into

AC

three categories: multiple kernel/graph learning based methods, subspace learning based methods and deep learning based methods.

5

ACCEPTED MANUSCRIPT

2.1. Multiple Kernel/Graph Learning based Methods Multiple kernel/graph learning based methods (e.g., [19, 4, 14]) represent each

CR IP T

view by a kernel or a similarity graph, and then different kernels/graphs are fused into a unified one to optimally preserve the similarities of data. The final results are usually obtained by clustering (using k-means or spectral clustering methods, etc) on the unified kernel/graph. By assuming that different views will form

consensus on a sample to the same cluster, a co-training based spectral cluster-

AN US

ing method is proposed in [19]. The spectral embedding of one view is used to constrain the other view until the clustering of the two views are consistent with each other. Similarly, multi-modal spectral clustering [4] learns a shared graph Laplacian matrix by unifying different views. To obtain a more robust solution, a non-negative constraint is imposed on the cluster indicator matrix. In addi-

M

tion, to improve the effectiveness and robustness of the multi-view fusion, several methods [16, 27, 29] assign fusion weights to different views. Reliable views are

ED

assigned higher weights and unreliable ones are assigned lower weights. Thus, more accurate clustering results can be obtained. To automatically learn weight

PT

for each view, auto-weighted multi-view clustering methods are proposed [27, 29] to adjust weights without introducing additive parameters. Since original data

CE

may contain noise and outliers that lead to unreliable graphs, a multi-view learning method is proposed [28] to simultaneously perform local structure learning and clustering. The clustering results can be efficiently obtained by directly parti-

AC

tioning the learned optimal graph. Considering the sample-specific characteristics of data, some localized kernel fusion methods are developed [11], where better fusion results can be achieved than global data fusion methods.

6

ACCEPTED MANUSCRIPT

2.2. Subspace Learning based Methods Subspace learning based methods [24, 13] aim to learn a unified latent sub-

CR IP T

space from multi-view data where views with different physical meanings can be compared. Several methods [24, 13] learn the common subspace by Non-negative Matrix Factorization (NMF) [21]. By jointly factorizing multiple matrices to-

wards a common consensus, compatible clustering results can be learned. A joint

non-negative matrix factorization method is proposed in [24] which encourages

AN US

clustering solutions of each view towards a consensus. To guarantee that the dif-

ferent views are comparable with each other, an effective normalization strategy is adopted. Co-regularized NMF is proposed in [13] to jointly factorize multi-view matrices for clustering. Two co-regularization paradigms are introduced to make different views consistent. In addition to matrix factorization, subspace cluster-

M

ing [39, 8] achieves promising performance for multi-view learning. It is first proposed to simultaneously cluster data and find a low-dimensional subspace fit-

ED

ting each cluster of data, and then applied for multi-view clustering [5, 45, 9, 43]. DiMSC [5] learns diversified subspace representations to explore the complemen-

PT

tarity of multi-view data. MVSC [9] performs subspace clustering for each view simultaneously, and a common indicator is used to obtain a common clustering

CE

structure. ECMSC [43] adopts exclusivity and consistency constraints for subspace learning to obtain a consistent clustering result from different views. However, these subspace clustering methods are based on shallow models so that it is

AC

hard to learn robust representation for complex distribution of data. 2.3. Deep Learning based Methods Recently, deep models are widely leveraged for this task because of good performance. Deep canonical correlation analysis (DCCA) [1] is a nonlinear exten7

ACCEPTED MANUSCRIPT

sion of CCA, and the parameters in both view of the deep model are jointly learned to maximize the total correlation. DCCAE [42] combines CCA and auto-encoder

CR IP T

for multi-view feature learning and achieves better clustering performance than CCA and kernel CCA. Multimodal deep Boltzmann machine [37] is developed to

learn the probability density over the space of multimodal data. It can be used

to learn a unified representation that fuses different views together. To obtain abstract data representation and hierarchical structure, deep matrix factorization

AN US

is developed to factorize data matrix into multiple factors [26, 38], and then ex-

tended for multi-view clustering [48]. Compared to DCCA or DCCAE that can only handle data with two views, multi-view deep matrix factorization [48] is capable of clustering data with more than two views. Nevertheless, multi-view deep matrix factorization [48] mainly uses the top layer of the network and ignores the

M

information contained in the intermediate layers. These layers contain rich attribute information and diversified data distribution structures, resulting in better

PT

3. Methodology

ED

clustering performance.

3.1. Preliminaries

CE

Consider a data matrix X ∈ Rd×n that corresponds to a dataset of n samples

each with d features represented by a d-dim vector and a column in X. Then

AC

multi-view data with V views and n samples can be denoted by a set of matrices {X (v) ∈ Rdv ×n }Vv=1 , where dv is the dimension of the v-th view. Our objec-

tive is to learn a low-dimensional consensus subspace F ∈ Rc×n to represent

multi-view data and obtain the final clustering results through F . Next, we will

introduce two closely related methods: Semi-Nonnegative Matrix Factorization 8

ACCEPTED MANUSCRIPT

(Semi-NMF) and subspace clustering. The Semi-NMF method [7] factorizes X into two matrices

Z,H

s.t. H ≥ 0,

(1)

CR IP T

min ||X − ZH||2F ,

where Z ∈ Rd×k is the loading matrix and H ∈ Rk×n is the coefficient matrix. Different from NMF [21] that restricts all the matrices to be non-negative, SemiNMF allows X and Z to have mixed signs to achieve wider applicability. Since H

AN US

contains strictly non-negative elements, it can be regarded as the cluster indicators and provides interpretable data representation.

Considering original dataset that may be produced by complex data distributions with multiple attributes such as pose, expression and identity in face dataset, deep Semi-NMF [38] is proposed to learn the inherent attributes and higher-level

M

feature representation, which decomposes data matrix X into m layers. The learning process is

ED

X ≈ Z1 H1+ ,

PT

X ≈ Z1 Z2 H2+ , .. .

(2)

+ , X ≈ Z1 Z2 ...Zm Hm

CE

where Zi ∈ Rki−1 ×ki and Hi ∈ Rki ×n are the loading matrix and coefficient matrix of the i-th layer, respectively. ki is the dimension of the i-th layer. (a)+ = max(0, a), which corresponds to the hinge operation.

AC

Subspace clustering [39, 8] assumes that data are drawn from different sub-

spaces, and its goal is to cluster data according to their underlying subspaces. The low-rank representation (LRR) method [22] learns a low-rank subspace representation and achieves promising clustering performance. Given a data matrix 9

ACCEPTED MANUSCRIPT

X ∈ Rd×n , LRR solves the self-representation problem by finding the lowest rank representation of all data as

(3)

s.t. X = XS + E,

CR IP T

min ||S||∗ , S

where S ∈ Rn×n is the learned low-rank subspace representation of data X, E

is the error matrix, || · ||∗ denotes the nuclear norm of a matrix, which equals

to the sum of its singular values. The learned subspace S is an effective data

AN US

representation that captures the relationships between data samples and provides explicit clustering structure.

3.2. Deep Low-Rank Subspace Learning

To reveal multiple hidden attributes and different clustering structures inherent in data, we perform deep Semi-NMF on each view and extract multi-layer

(v)

M

representations as (v)

s.t.

V P

ED

Od (Zi , Hi ) = (v) Hi

v=1

(v)

(v)

(v)

(v)

||X (v) − Z1 Z2 · · · Zm Hm ||2F ,

(4)

≥ 0, i ∈ {1, 2, · · · , m}, v ∈ {1, 2, · · · , V }, (v)

PT

where m is the number of layers, Zi

(v)

and Hi

are the learned loading matrix and

coefficient matrix of the i-th layer for view v, respectively. By performing deep

CE

matrix factorization on each view, different attributes and clustering structures of (v)

multi-view data can be learned in the coefficient matrices of each layer {Hi }m i=1 . (v)

(v)

(v)

AC

m−1 It should be noted that {Hi }i=1 are implicitly contained in Od (Zi , Hi ), and (v)

from (2) we can observe that Hj (v)

(v)

(v)

(v)

(v)

≈ Zj+1 Zj+2 ...Zm Hm .

Although {Hi }m i=1 are learned from different layers and views, they are not

effective and robust enough to represent the relation between data. Inspired by the success of subspace clustering, we propose to further enhance the clustering 10

ACCEPTED MANUSCRIPT

(v)

∈ Rn×n from

structure by learning the low-rank subspace representation Si (v)

Hi . LRR [22] is adopted for low-rank subspace learning (v)

V X m X

(v)

(||Hi

v=1 i=1

(v)

s.t. Si (v)

where Si

(v)

(v)

(v)

− Hi Si ||2F + ||Si ||∗ ),

CR IP T

(v)

Ol (Si , Hi ) =

(5)

≥ 0, i ∈ {1, 2, ..., m}, v ∈ {1, 2, ..., V },

∈ Rn×n is the subspace representation for the i-th layer of view v. (v)

The non-negative constraint Si

≥ 0 is used to constrain the solution. Through

(v)

can be treated as the similarity graph and each

(v)

AN US

self-representation learning, Si

(v)

reflects the similarity between two samples, so Si

element in Si

encodes the (v)

correlations of all samples. By adopting the nuclear norm regularization, Si

achieves more explicit block-diagonal property [22], which can effectively en-

ferent layers and views.

M

hance data correlations and reveal low-rank clustering structures inherent in dif-

ED

3.3. Multi-View Multi-Layer Subspace Ensemble Learning (v)

After obtaining the subspaces Si

from different views and layers, we pro-

pose to learn a low-dimensional consensus subspace F ∈ Rc×n which aggre-

PT

gates the clustering information from all the subspaces. We use the difference (v)

between F and projection of Si

onto F to measure the loss of embedding, i.e., if

is smaller, then F accommodates the clustering structure of view

CE

||F −

(v) F Si ||2F

v more. The orthonormal constraint F F T = I is to avoid the trivial solution.

AC

In addition, to better leverage the complementary nature of these subspaces, two factors should be considered. First, the importance of different subspaces is different, so the importance of each subspace should be explicitly controlled. Second, although different views describe the same data, some views generate more accurate and reliable descriptions than the others. Therefore, we should give different 11

ACCEPTED MANUSCRIPT

views corresponding importance weights so that they provide complementary information from each other for a better overall clustering result. To this end, we

ensemble as (v)

Oe (F, Si , α) =

V X m X v=1 i=1

(v)

(v)

αi ||F − F Si ||2F + λ3 ||α||G ,

s.t. α ≥ 0, 1T α = 1, F F T = I, (1)

(1)

(v)

(v)

(1)

(2)

CR IP T

introduce a structured sparsity regularization for multi-view multi-layer subspace

(2)

(2)

(V )

(6)

AN US

where α = [α1 α2 · · · αm α1 α2 · · · αm · · · αm ]T ∈ RmV ×1 is the coefficient PV (v) vector. || · ||G is the group `1 -norm [41], defined as ||α||G = v=1 ||α ||2 , (v)

and α(v) = [α1 α2 ...αm ]T ∈ Rm×1 . Here we make the coefficients from view

v belong to the same group α(v) such that the number of groups is equal to the number of views V . λ3 is used for tuning the sparseness of coefficient vector α. (V )

we assume that if F Si

M

The merits of the proposed subspace ensemble approach are twofold. First, (v)

is more consistent with consensus subspace F , then Si (v)

ED

is considered to be more reliable and αi should be larger. This makes the estimation of F more accurate. Second, the importance of different views is controlled (v)

PT

by using group `1 -norm. If the distances between the subspaces of the v-th view (v) {Si }m i=1 and the consensus subspace F are larger, the coefficients in group α

CE

would become lower due to the higher cost. The structured sparsity regularization can make the effective views more important and suppress the unreliable views,

AC

which further enhances the accuracy and robustness of the low-dimensional consensus subspace F . 3.4. Ultimate Objective Function The overall objective function is formed by combining the above sub-problems. By jointly conduct deep matrix factorization, low-rank subspace learning and sub12

ACCEPTED MANUSCRIPT

space ensemble learning, we propose to minimize the following objective function for multi-view clustering (v)

(v)

(v)

(v)

(v)

|v=1

{z

multi-view deep matrix factorization V X m  X (v) (v) (v) + λ1 ||Hi − Hi Si ||2F v=1 i=1

+ λ2 |

(v)

s.t. Hi

{z

deep low-rank subspace learning

V X m X v=1 i=1

(v)

(v)

αi ||F − F Si ||2F + λ3 ||α||G , {z

multi-view multi-layer subspace ensemble learning (v)

≥ 0, Si

(v)

+ ||Si ||∗



(7)

}

AN US

|

}

CR IP T

Ou = Od (Zi , Hi ) + Ol (Si , Hi ) + Oe (F, Si , α) V X (v) (v) (v) (v) 2 = ||X (v) − Z1 Z2 · · · Zm Hm ||F

}

≥ 0, α ≥ 0, 1T α = 1, F F T = I,

M

where we add two parameters λ1 , λ2 to control the weight of each sub-problem. λ1 is the weight of low-rank subspace learning, and λ2 controls the weight of

ED

subspace ensemble learning. By solving the ultimate objective function Ou , our method can learn F through the low-rank subspaces from different layers and

PT

views. Then, F is treated as the new data representation and the multi-view clustering results are obtained by performing clustering methods on F . Spectral clus-

CE

tering is adopted in this work and the number of clusters is set as the true number of categories of data.

AC

4. Optimization We use an iterative block coordinate descent algorithm for the optimization of

problem (7). In each iteration, only one variable is updated while the others are (v)

fixed. For initialization, we pre-train each layer and obtain initial Zi 13

(v)

and Hi

ACCEPTED MANUSCRIPT

(v)

(v)

(v)

(v)

similar to [38]. Then, all the variables Zi , Hi , Si , F and αi are fine-tuned according to the update rules. Next, we present the details of pre-training and

CR IP T

derivations of update rules. The whole learning procedure for solving problem (7) is summarized in Algorithm 2. 4.1. Pre-training

We pre-train each view X (v) (v ∈ {1, · · · , V }) layer by layer. For each view, (v)

(v)

(v)

the first layer is learned by X (v) ≈ Z1 H1 , where Z1 (v)

(v)

(v)

(v)

Z2 H2 , where Z2

(v)

∈ Rk1 ×k2 and H2

(v)

until all the layers are pre-trained, i.e., H2

(v)

is further decomposed by H1

AN US

Rk1 ×n . After that, coefficient matrix H1

(v)

∈ Rdv ×k1 and H1





∈ Rk2 ×n . This process is continued (v)

(v)

(v)

(v)

(v)

≈ Z3 H3 , · · · , Hm−1 ≈ Zm Hm .

Then, all the variables are fine-tuned by the following update rules.

(v)

M

(v)

4.2. Update Rule for Zi

(v)

To solve Zi , let the derivative ∂(Ou )/∂(Zi ) = 0, and then the update rule

ED

can be obtained by

(v)

Zi

(v)

(v)

e ← Ψ† X (v) H i

(v) †

(8)

,

PT

where Ψ = Z1 · · · Zi−1 . (·)† denotes the Moore-Penrose pseudo-inverse operator

e (v) is the reconstruction of the i-th layer’s latent factor and A† = (AT A)−1 AT . H i

CE

in the v-th view.

(v)

AC

4.3. Update Rule for Hi

(v)

We adopt the similar method as in [7] to obtain the update rule for Hi . Keep

the related parts from Ou and we have (v)

(v)

J = ||X (v) − ΨHi ||2F + λ1 ||Hi 14

(v)

(v)

− Hi Si ||2F .

(9)

ACCEPTED MANUSCRIPT

(v)

The partial derivative with respect to Hi (v) ∂Hi

(v)

= − 2ΨT X (v) + 2ΨT ΨHi

(v)

+ 2λ1 (Hi

(v)

(v)

− 2Hi Si

(v)

(v)

(v) T

+ Hi Si S i

).

(10)

CR IP T

∂J

is given as follows,

(v)

From the above formulations, we can obtain the update rule for Hi r Φn (v) (v) Hi ← Hi , Φd where

(v)

(v)

as follows1 ,

(11)

(v)

AN US

Φn = [ΨT X (v) ]pos + [ΨT ΨHi ]neg + 2λ1 [Hi Si ]pos (v)

(v)

(v) T neg

(v)

+λ1 [Hi ]neg + λ1 [Hi Si Si

]

,

(v)

(v)

(v)

Φd = [ΨT X (v) ]neg + [ΨT ΨHi ]pos + 2λ1 [Hi Si ]neg (v)

(v)

(v) T pos

(v)

+λ1 [Hi ]pos + λ1 [Hi Si Si

]

.

[·]pos and [·]neg denote the operation that replaces all the negative and positive

|Ajk | + Ajk |Ajk | − Ajk , [A]neg . jk = 2 2

ED

[A]pos jk =

M

elements in the matrix by 0, respectively. Specifically, we have

(v)

4.4. Update Rule for Si

(v)

PT

Keeping the parts that are related to Si

CE

obtain the following problem to solve (v)

min λ1 ||Hi (v) Si

AC

s.t.

1

(v) Si

(v) (v)

from the overall objective in (7), we

(v) Si (v)

(v)

(v)

− Hi Si ||2F + λ1 ||Si ||∗ + λ2 αi ||F − F Si ||2F ,

(12)

≥ 0.

The theoretical analysis of update rule (11) is presented in the appendix to illustrate its effec-

tiveness.

15

ACCEPTED MANUSCRIPT

This problem can be solved by alternating direction method of multipliers (ADMM) [22]. Similar to [23], (12) can be rewritten as an unconstrained version: (v)

(v)

(v)

(v)

(v) Si

(v) +λ2 αi ||F



(v) F Si ||2F

CR IP T

− Hi Si ||2F + λ1 ||Si ||∗

min λ1 ||Hi

(13)

(v) lR+ (Si ),

+

AN US

where lR+ (x) is the indicator function, defined as   0 if x ≥ 0, lR+ (x) =  +∞ otherwise.

Then we introduce auxiliary variables and problem (13) is equivalent to the following problem (v) Si

s.t.

(v) (v) Hi Si

(v)

− Y1 ||2F + λ1 ||Y2 ||∗ + λ2 αi ||F − Y3 ||2F + lR+ (Y4 ), = Y1 ,

(v) Si

= Y2 ,

(v) F Si

M

(v)

min λ1 ||Hi

= Y3 ,

(v) Si

(14)

= Y4 .

ED

The augmented Lagrangian function of problem (14) is (v)

(v)

L(Si , Y1 , Y2 , Y3 , Y4 ) = λ1 ||Hi (v)

− Y1 ||2F

PT

+λ1 ||Y2 ||∗ + λ2 αi ||F − Y3 ||2F + lR+ (Y4 ) (v)

(v)

+ µ2 ||Hi Si (v)

CE

+ µ2 ||F Si

(v)

− Y1 − D1 ||2F + µ2 ||Si (v)

− Y3 − D3 ||2F + µ2 ||Si

(15)

− Y2 − D2 ||2F

− Y4 − D4 ||2F .

(v)

We apply alternative minimization method to solve all the variables Si , Y1 , Y2 , Y3

AC

and Y4 . In each step, only one variable is updated while keep the others fixed. (v)

(v)

To solve Si from (15), we set the partial derivative ∂(L(Si , Y1 , Y2 , Y3 , Y4 ))/ (v)

∂(Si ) = 0 and obtain (v)

Si

(v)T

← B −1 (Hi

ξ1 + ξ2 + F T ξ3 + ξ4 ), 16

(16)

ACCEPTED MANUSCRIPT

(v)T

where B = Hi

(v)

+ F T F + 2I, ξj = Yj + Dj , and I is the identity matrix.

Hi

(v)

To solve Y1 , we set ∂(L(Si , Y1 , Y2 , Y3 , Y4 ))/∂(Y1 ) = 0 and obtain 1 (v) (v) (v) [2λ1 Hi + µ(Hi Si − D1 )]. 2λ1 + µ

To obtain Y2 , we solve the following problem min λ1 ||Y2 ||∗ + Y2

(17)

CR IP T

Y1 ←

µ (v) ||Si − Y2 − D2 ||2F . 2

(18)

Problem (18) can be solved by Singular Value Thresholding operator [3], i.e.,

AN US

Θτ (X) = U Λτ V T , where X = U Λτ V T is the singular value decomposition, and Λτ (x) = sgn(x)max(|x| − τ, 0) is the shrinkage operator. The update rule for Y2

is

(v)

Y2 ← Θλ1 /µ (Si (v)

− D2 ).

(19)

of Y3 can be obtained as 1

(v) 4λ2 αi



(v)

(v)

[4λ2 αi F + µ(F Si

ED

Y3 ←

M

If we set partial derivative ∂(L(Si , Y1 , Y2 , Y3 , Y4 ))/ ∂(Y3 ) = 0, the update rule

− D3 )].

(20)

Since the non-negative constraint is imposed on Y4 , it can be solved by the fol-

PT

lowing update rule

(v)

Y4 ← max(Si

− D4 , 0).

(21)

CE

At last, the Lagrangian multipliers D1 , D2 , D3 and D4 are updated based on the ADMM method. All the variables are solved by the above update rules, and the (v)

is summarized in Algorithm 1.

AC

whole procedure for solving Si 4.5. Update Rule for F

Fixing the related parts of S from Ou , we can obtain the following problem min F

V X m X v=1 i=1

(v)

(v)

αi ||F − F Si ||2F , 17

s.t. F F T = I,

(22)

ACCEPTED MANUSCRIPT

(v)

Algorithm 1: The ADMM algorithm to solve Si (v)

(v)

2

Initialization: ∀j, Yj = Dj = 0 while not converged do (v)

(v)T

← B −1 (Hi

3

Si

4

Y1 ←

5

Y2 ←

6

Y3 ←

7

Y4 ← max(Si

ξ1 + ξ2 + F T ξ3 + ξ4 )

(v) (v) (v) 1 [2λ1 Hi + µ(Hi Si 2λ1 +µ (v) Θλ1 /µ (Si − D2 ) (v) 1 [4λ2 αi F (v) 4λ2 αi +µ (v)

(v)

+ µ(F Si

− D4 , 0)

8

update the Lagrange multipliers:

9

D1 ← D1 − (Hi Si

(v)

11

D3 ← D3 −

12

D4 ← D4 − (Si end (v)

− Z4 )

PT

Output: Si .

− Z3 )

ED

13

(v)

− Z1 )

− Z2 )

(v) (F Si

− D3 )]

M

(v)

D2 ← D2 − (Si

10

(v)

− D1 )]

AN US

1

CR IP T

Input: Hi , F , αi , λ1 , λ2 , µ.

then the following equivalent problem is derived (the detailed derivations can be

AC

CE

found in the appendix),

where L =

V P m P

v=1 i=1

min Tr(F LF T ),

s.t. F F T = I,

F

(v)

(v)

αi (I − 2Si

(v)

(v) T

+ Si Si

(23)

). This is a standard spectral clus-

tering objective, and the solution of F is given by the eigenvectors of L corresponding to the k smallest eigenvalues.

18

ACCEPTED MANUSCRIPT

4.6. Update Rule for α The relevant terms in the overall objective in (7) with respect to α are

(v) αi

V X m X v=1 i=1

(v) (v)

αi di + λ3 ||α||G ,

s.t. α ≥ 0, 1T α = 1, (v)

where di

(v)

CR IP T

min λ2

(24)

= ||F − F Si ||2F . Problem (24) is solved by the coordinate descent (p)

(q)

method. In each iteration, only two variables αj , αk

∈ α are selected to be

AN US

updated while the others are fixed. The coordinate descent method for solving problem (24) is shown in the appendix. Thus, we have  √ (q) (p) (p) (q)  α(p)∗ ← λ(αj +αk )+ ξp (dk −dj ) , j 2λ  α(q)∗ ← α(p) + α(q) − α(p)∗ , j

(p)

j

k

M

k

(q)

λ(

ξp +

ξq )

  α(q)∗ ← α(p) + α(q) − α(p)∗ , j j k k

PT

Pm

ED

if αj and αk are in the same group; Otherwise,  √ (p) (q) √ √ (q) (p) ξp ξq (dk −dj )   α(p)∗ ← λ ξp (αj +αk√)+ √ , j where ξp =

(25)

i=1

(p)

(αi )2 , ξq =

Pm

i=1

(q)

(αi )2 , λ = λ3 /λ2 . To guarantee the non-

(p)∗

(p)∗

= max(αj

(q)∗

, 0), and vice versa for αk .

CE

negative constraint α > 0, we set αj

(26)

4.7. Time Complexity

AC

The proposed algorithm includes two parts: pre-training and fine-tuning, and

we analysis the complexity of each part. The computational complexity for pretraining is of order O(mV t1 (ndk+nk 2 +kn2 )), where n is the number of samples, m is the number of layers, t1 is the number of iteration until convergence, k is the

maximum dimension of all the layers, and d is the maximum dimension of the 19

ACCEPTED MANUSCRIPT

Algorithm 2: The learning procedure of DLRSE Input: Multi-view matrices {X (i) }Vi=1 and parameters: m, λ1 , λ2 , λ3 . (v)

(v)

Initialize Zi , Hi

by pre-training, αi

2

while not converged do

=

for v = 1, ..., V and i = 1, ...m do

3

e †; ← Ψ† X (v) H i q (v) (v) (v) n ; Update Hi by Hi ← Hi ∇ ∇d by Zi

5

(v)

by Algorithm 1;

Update Si

6

(v)

(v)

.

AN US

(v)

Update Zi

4

7

end

8

Update F by solving problem (23);

9

Update α by update rule (25) and (26);

10

1 m×V

CR IP T

(v)

1

end

M

Output: The consensus subspace F .

ED

original feature of each view. In the fine-tuning stage, the main computational (v)

(v)

(v)

cost is dominated by the update of Si , Zi , Hi

(v)

and F . Solving Si

includes

PT

matrix inversion and singular value decomposition, so the complexity for fine tuning is of order O mV t2 (ndk + nk 2 + kn2 + t3 (n3 + kn2 + cn2 )) , where t2

CE

and t3 are the number of iterations of Algorithm 2 and Algorithm 1, respectively.

AC

5. Experiments Face data contains good hierarchical structure and multiple attributes, which

is suitable to illustrate the properties of the proposed multi-view multi-layer ensemble learning method. Thus, we conduct experiments on four face image/video datasets to evaluate the clustering performance. We first introduce the datasets, 20

ACCEPTED MANUSCRIPT

the compared methods and parameter settings. Then, the clustering performance and the corresponding parameter sensitivity are analyzed in detail.

CR IP T

5.1. Datasets

Four face datasets are adopted in our experiments. The Yale dataset [2] con-

tains 165 images of 15 individuals, with faces of different expressions and features

such as with/without glasses, left-light, center-light, right-light, happy, sad, sleepy and surprised. The Extended YaleB dataset [10] contains 2, 414 face images from

AN US

38 individuals. Each individual has 64 images under different illuminations. As in [5], we select first 10 subjects with 640 images for our experiment. The ORL dataset [32] contains totally 400 images belong to 40 distinct individuals. There are 10 different images for each subject, and some of the images are taken at dif-

M

ferent lighting, times, expressions and facial details. The Notting-Hill dataset [47] is a video face clustering dataset from clips of the movie “Notting-Hill”. It

ED

consists of 4, 660 faces of 5 main casts in 76 tracks. For fair comparison, we adopt the same data preprocessing method as that in [5]. The images are first resized into 48×48, and three visual features are extracted

PT

as different views: intensity, LBP [31] and Gabor [20]. LBP is a 59 dimension histogram which is extracted from 9 × 10 pixel patches over cropped images. Gabor

CE

wavelets are obtained with scale λ = 4 at orientations θ = {0◦ , 45◦ , 90◦ , 135◦ },

AC

resulting in feature vectors with 6, 750 dimensions. 5.2. Compared Methods and Evaluation Metrics

We compare DLRES with the following baseline and state-of-the-art multi-

view clustering methods.

21

ACCEPTED MANUSCRIPT

BestSV (baseline). The method conducts spectral clustering on each view and the performance of the best single view is reported.

CR IP T

ConcatFea (baseline). The method concatenates the features of all views into a joint feature vector, and then performs spectral clustering on it.

ConcatPCA (baseline). This method concatenates all the views and perform PCA to obtain a new representation. Clustering is performed by spectral clustering.

AN US

Co-Train [18]. This is a co-training based multi-view spectral clustering method which assumes that different views would assign a sample to the same cluster.

GDBNSC [36]. This is a single-view spectral clustering method which integrates geometrical structure and discriminative structure of data.

NSSRD [34]. The method is a single-view feature selection method that simul-

M

taneously exploits the geometric information of feature space and data space. SGFS [33]. This is a single-view feature selection method which preserves geo-

ED

metric structure information on the feature manifold. DFSC [35]. The method adopts self-representation property and local informa-

PT

tion of both data space and feature space to select representative features. MultiNMF [24]. The method learns a commonly shared non-negative latent

CE

space from multiple views by NMF, and then the clustering results are obtained from this commonly shared latent space.

AC

NaMSC. The method first learns subspace representations from each view using SMR [15], and then applies spectral clustering on the combination of the learned representations. DiMSC [5]. The method learns diversified subspaces from different views by using Hilbert-Schmidt Independence Criterion, and then integrates the subspaces 22

ACCEPTED MANUSCRIPT

for clustering. MVSC [9]. The method conducts clustering on subspace of each view simulta-

CR IP T

neously, then a consistent cluster structure is learned from multiple subspaces. DMVMF [48]. The method extends deep Semi-NMF to multi-view clustering task. Multi-view information are shared on the top layer of the deep model.

ECMSC [43]. The method uses exclusivity and consistency constraints for subspace learning to obtain consistent clustering from multiple views.

AN US

SWMC [30]. This is a totally self-weighted multi-view clustering method, and

the cluster label can be directly assigned to each data without any post-processing. This paper focuses on multi-view methods (i.e., more than two data views). Therefore, the two-view methods such as those based on deep CCA [1, 42] are

M

not considered in our experiment. Since the single-view clustering and feature selection methods (e.g., GDBNSC, NSSRD, SGFS and DFSC) cannot be directly

ED

applied in multi-view clustering, we concatenate multi-view feature into a joint feature vector as the new data representation. Specifically, GDBNSC is performed using the new representation. NSSRD, SGFS and DFSC first select effective fea-

PT

tures from the new representation and then perform clustering on the selected features. All the parameters of the compared methods are set as suggested by the

CE

authors.

Similar to [5, 43], we adopt 6 evaluation metrics including normalized mutual

AC

information (NMI), accuracy (ACC), adjusted rand index (AR), F-score, Precision

and Recall. Higher values of these metrics indicate better performance. Each metric indicates a specific property of the clustering, so adopting multiple metrics can provide a comprehensive evaluation of clustering performance.

23

5.2

15.6

5.1

15.4

5 4.9 4.8 4.7

0

20

40

60

80

15.2 15 14.8 14.6

100

Iteration

0

20

40

60

80

100

Iteration

(a) Yale

(b) Extended YaleB

110

70

90

80

70

66

AN US

Objective Value

100

Objective Value

CR IP T

Objective Value

Object Value

ACCEPTED MANUSCRIPT

62

58

54

60

0

20

40

60

Iteration

100

0

20

40

60

80

100

Iteration

(d) Notting-Hill

M

(c) ORL

80

Figure 2: The convergence curves on four datasets.

ED

5.3. Parameter Setting and Convergence Analysis There are several hyper-parameters need to be determined in DLRSE, includ-

PT

ing the layer structure, and parameters λ1 , λ2 , λ3 in the objective function (7). In practice, we find λ1 = λ2 can obtain good performance so that the parameters

CE

of DLRSE can be easily tuned. In the experiments, λ1 and λ2 are chosen from {10−7 , 10−6 ,..., 10−1 }. For λ3 , we enumerate µ = λ3 /λ2 in the range of {101 ,

AC

102 ,..., 107 }. Moreover, the layer structure is selected from {(50), (100-50), (150-

100-50)}, where (150-100-50) represents a 3-layer model and the dimensions of the three layers are 150, 100 and 50, respectively. The parameter c (the dimension of F ) is chosen from {5, 10, 20, 30,..., 70}. The detailed sensitivity analysis of some important parameters are presented in Section 5.5. All the experiments are 24

ACCEPTED MANUSCRIPT

Table 1: Clustering Performance (mean ± standard deviation) on Yale. Bold font shows best

performance.

ACC 0.616 ± 0.030 0.544 ± 0.038 0.578 ± 0.038 0.630 ± 0.001 0.635 ± 0.006 0.658 ± 0.003 0.655 ± 0.006 0.663 ± 0.008 0.673 ± 0.001 0.636 ± 0.000 0.709 ± 0.003 0.694 ± 0.017 0.745 ± 0.011 0.771 ± 0.014 0.719 ± 0.017 0.798 ± 0.009

AR 0.440 ± 0.011 0.392 ± 0.009 0.396 ± 0.011 0.452 ± 0.010 0.458 ± 0.004 0.512 ± 0.005 0.509 ± 0.003 0.513 ± 0.008 0.495 ± 0.001 0.475 ± 0.004 0.535 ± 0.001 0.527 ± 0.020 0.579 ± 0.002 0.590 ± 0.014 0.571 ± 0.015 0.637 ± 0.010

F-score 0.475 ± 0.011 0.431 ± 0.008 0.434 ± 0.011 0.487 ± 0.009 0.493 ± 0.005 0.527 ± 0.004 0.539 ± 0.007 0.546 ± 0.010 0.527 ± 0.000 0.508 ± 0.007 0.564 ± 0.002 0.560 ± 0.018 0.601 ± 0.002 0.617 ± 0.012 0.585 ± 0.010 0.663 ± 0.009

Precision 0.457 ± 0.011 0.415 ± 0.007 0.419 ± 0.012 0.470 ± 0.010 0.477 ± 0.006 0.501 ± 0.003 0.519 ± 0.008 0.521 ± 0.011 0.512 ± 0.000 0.492 ± 0.003 0.543 ± 0.001 0.490 ± 0.024 0.598 ± 0.001 0.584 ± 0.013 0.564 ± 0.012 0.640 ± 0.011

Recall 0.495 ± 0.010 0.448 ± 0.008 0.450 ± 0.009 0.505 ± 0.007 0.509 ± 0.004 0.556 ± 0.004 0.560 ± 0.006 0.573 ± 0.009 0.543 ± 0.000 0.524 ± 0.004 0.586 ± 0.003 0.657 ± 0.013 0.613 ± 0.002 0.653 ± 0.013 0.607 ± 0.011 0.688 ± 0.008

CR IP T

NMI 0.654 ± 0.009 0.641 ± 0.006 0.665 ± 0.037 0.672 ± 0.006 0.687 ± 0.003 0.709 ± 0.004 0.712 ± 0.005 0.713 ± 0.009 0.690 ± 0.001 0.671 ± 0.011 0.727 ± 0.010 0.755 ± 0.011 0.782 ± 0.010 0.773 ± 0.010 0.754 ± 0.021 0.813 ± 0.006

AN US

Method BestSV ConcatFea ConcatPCA Co-Train [18] GDBNSC [36] NSSRD [34] SGFS [33] DFSC [35] MultiNMF [24] NaMSC [15] DiMSC [5] MVSC [9] DMVMF [48] ECMSC [43] SWMC [30] DLRSE

implemented on a PC with a 3.6 GHz i7 processor and 16 GB memory. Our objective function is solved by the iterative block coordinate descent method (Algorithm 2). As the convergence for Algorithm 2 is not straightfor-

M

ward to analyze in theory, the experiments demonstrate that our algorithm converges quickly and stably in practice. The convergence curves on all the datasets

ED

are shown in Figure 2. We observe that the proposed optimization method usually converges in several iterations rapidly. Specifically, DLRSE achieves convergence

PT

within 30 iterations on the Yale, Extended YaleB and ORL datasets, and converges within 70 iterations on the larger Notting-Hill dataset.

CE

5.4. Performance Comparison Multi-view clustering results on each dataset are shown in Table 1, Table 2,

AC

Table 3, and Table 4. For all the experiments, we repeat 10 times and report the mean value and standard deviation. Compared to the baseline single view clustering method BestSV, multi-view clustering based methods (i.e., Co-Train, MultiNMF and NaMSC) achieve better performance. In terms of NMI, the largest performance improvements on four datasets are: 15.9%, 43.1%, 6.8%, 10.6%, 25

ACCEPTED MANUSCRIPT

Table 2: Clustering Performance (mean ± standard deviation) on Extended YaleB. Bold font

shows best performance.

ACC 0.366 ± 0.059 0.224 ± 0.012 0.232 ± 0.005 0.186 ± 0.001 0.432 ± 0.009 0.442 ± 0.006 0.449 ± 0.011 0.435 ± 0.007 0.428 ± 0.002 0.581 ± 0.013 0.615 ± 0.003 0.619 ± 0.011 0.763 ± 0.001 0.783 ± 0.011 0.632 ± 0.015 0.788 ± 0.002

AR 0.225 ± 0.018 0.064 ± 0.003 0.069 ± 0.002 0.043 ± 0.001 0.237 ± 0.006 0.252 ± 0.008 0.268 ± 0.009 0.241 ± 0.008 0.231 ± 0.001 0.380 ± 0.002 0.453 ± 0.000 0.477 ± 0.008 0.512 ± 0.002 0.544 ± 0.008 0.468 ± 0.011 0.593 ± 0.005

F-score 0.303 ± 0.011 0.159 ± 0.002 0.161 ± 0.002 0.140 ± 0.001 0.334 ± 0.008 0.358 ± 0.007 0.367 ± 0.010 0.346 ± 0.006 0.329 ± 0.001 0.446 ± 0.004 0.504 ± 0.006 0.545 ± 0.007 0.564 ± 0.001 0.597 ± 0.010 0.547 ± 0.017 0.636 ± 0.004

Precision 0.296 ± 0.010 0.155 ± 0.002 0.158 ± 0.001 0.137 ± 0.001 0.305 ± 0.008 0.330 ± 0.006 0.337 ± 0.009 0.316 ± 0.005 0.298 ± 0.001 0.411 ± 0.002 0.481 ± 0.002 0.506 ± 0.007 0.525 ± 0.001 0.513 ± 0.009 0.501 ± 0.019 0.569 ± 0.006

Recall 0.310 ± 0.012 0.162 ± 0.002 0.164 ± 0.002 0.143 ± 0.002 0.368 ± 0.009 0.392 ± 0.009 0.403 ± 0.010 0.382 ± 0.007 0.372 ± 0.002 0.486 ± 0.001 0.534 ± 0.001 0.591 ± 0.008 0.610 ± 0.001 0.718 ± 0.006 0.603 ± 0.016 0.721 ± 0.005

CR IP T

NMI 0.360 ± 0.016 0.147 ± 0.005 0.152 ± 0.003 0.302 ± 0.007 0.375 ± 0.006 0.386 ± 0.005 0.393 ± 0.010 0.382 ± 0.008 0.377 ± 0.006 0.594 ± 0.004 0.635 ± 0.002 0.626 ± 0.008 0.649 ± 0.002 0.769 ± 0.012 0.640 ± 0.013 0.791 ± 0.003

AN US

Method BestSV ConcatFea ConcatPCA Co-Train [18] GDBNSC [36] NSSRD [34] SGFS [33] DFSC [35] MultiNMF [24] NaMSC [15] DiMSC [5] MVSC [9] DMVMF [48] ECMSC [43] SWMC [30] DLRSE

Table 3: Clustering Performance (mean ± standard deviation) on ORL. Bold font shows best

performance.

AR 0.655 ± 0.005 0.553 ± 0.007 0.564 ± 0.010 0.656 ± 0.007 0.666 ± 0.010 0.686 ± 0.005 0.745 ± 0.011 0.714 ± 0.006 0.751 ± 0.027 0.769 ± 0.020 0.802 ± 0.000 0.783 ± 0.023 0.815 ± 0.013 0.810 ± 0.012 0.771 ± 0.006 0.831 ± 0.011

M

ACC 0.726 ± 0.025 0.648 ± 0.033 0.675 ± 0.028 0.730 ± 0.005 0.733 ± 0.009 0.759 ± 0.006 0.802 ± 0.010 0.784 ± 0.007 0.828 ± 0.017 0.813 ± 0.003 0.838 ± 0.001 0.816 ± 0.011 0.860 ± 0.012 0.854 ± 0.011 0.815 ± 0.010 0.885 ± 0.013

ED

NMI 0.884 ± 0.002 0.831 ± 0.003 0.835 ± 0.004 0.901 ± 0.003 0.896 ± 0.011 0.902 ± 0.005 0.908 ± 0.012 0.905 ± 0.004 0.906 ± 0.011 0.926 ± 0.006 0.940 ± 0.003 0.932 ± 0.005 0.941 ± 0.012 0.947 ± 0.009 0.931 ± 0.008 0.952 ± 0.004

PT

Method BestSV ConcatFea ConcatPCA Co-Train [18] GDBNSC [36] NSSRD [34] SGFS [33] DFSC [35] MultiNMF [24] NaMSC [15] DiMSC [5] MVSC [9] DMVMF [48] ECMSC [43] SWMC [30] DLRSE

F-score 0.664 ± 0.005 0.564 ± 0.007 0.574 ± 0.010 0.665 ± 0.007 0.670 ± 0.009 0.705 ± 0.007 0.733 ± 0.012 0.715 ± 0.005 0.757 ± 0.027 0.774 ± 0.004 0.807 ± 0.003 0.787 ± 0.022 0.823 ± 0.013 0.821 ± 0.015 0.777 ± 0.013 0.851 ± 0.011

Precision 0.610 ± 0.006 0.522 ± 0.007 0.532 ± 0.011 0.612 ± 0.008 0.623 ± 0.008 0.669 ± 0.007 0.705 ± 0.011 0.673 ± 0.004 0.739 ± 0.028 0.731 ± 0.001 0.764 ± 0.012 0.729 ± 0.025 0.799 ± 0.019 0.783 ± 0.008 0.720 ± 0.015 0.829 ± 0.014

Recall 0.728 ± 0.005 0.614 ± 0.008 0.624 ± 0.008 0.727 ± 0.006 0.726 ± 0.011 0.746 ± 0.006 0.764 ± 0.013 0.763 ± 0.005 0.776 ± 0.026 0.823 ± 0.002 0.856 ± 0.004 0.855 ± 0.009 0.849 ± 0.008 0.859 ± 0.012 0.843 ± 0.013 0.875 ± 0.008

respectively. This indicates that multi-view methods can exploit multi-view com-

CE

plementary information and achieves better results than using only single view. In addition, it is difficult for BestSV to select which view is the best one for cluster-

AC

ing. Multi-view methods can automatically evaluate the importance of each view to obtain better clustering results. For multi-view clustering based methods, DLRSE outperforms other methods

on all datasets by significant margins. Specifically, for the Yale dataset, DLRSE

26

ACCEPTED MANUSCRIPT

Table 4: Clustering Performance (mean ± standard deviation) on Notting-Hill. Bold font shows

best performance.

ACC 0.813 ± 0.000 0.673 ± 0.033 0.733 ± 0.008 0.689 ± 0.027 0.826 ± 0.009 0.818 ± 0.008 0.881 ± 0.009 0.810 ± 0.006 0.831 ± 0.001 0.752 ± 0.013 0.843 ± 0.021 0.818 ± 0.018 0.871 ± 0.009 0.891 ± 0.011 0.830 ± 0.013 0.914 ± 0.026

AR 0.712 ± 0.020 0.612 ± 0.041 0.598 ± 0.015 0.589 ± 0.035 0.754 ± 0.007 0.729 ± 0.007 0.779 ± 0.011 0.715 ± 0.008 0.762 ± 0.000 0.666 ± 0.004 0.787 ± 0.001 0.772 ± 0.017 0.803 ± 0.002 0.788 ± 0.011 0.765 ± 0.015 0.825 ± 0.023

F-score 0.775 ± 0.015 0.696 ± 0.032 0.685 ± 0.012 0.677 ± 0.026 0.791 ± 0.008 0.780 ± 0.008 0.829 ± 0.011 0.776 ± 0.009 0.815 ± 0.000 0.738 ± 0.005 0.834 ± 0.001 0.824 ± 0.020 0.847 ± 0.002 0.834 ± 0.012 0.818 ± 0.013 0.881 ± 0.019

Precision 0.774 ± 0.018 0.699 ± 0.032 0.691 ± 0.010 0.688 ± 0.030 0.782 ± 0.007 0.776 ± 0.008 0.806 ± 0.012 0.770 ± 0.009 0.804 ± 0.001 0.746 ± 0.002 0.822 ± 0.005 0.817 ± 0.018 0.826 ± 0.007 0.845 ± 0.009 0.809 ± 0.015 0.871 ± 0.022

Recall 0.776 ± 0.013 0.693 ± 0.031 0.680 ± 0.014 0.667 ± 0.023 0.801 ± 0.008 0.784 ± 0.009 0.854 ± 0.011 0.782 ± 0.009 0.824 ± 0.001 0.730 ± 0.011 0.836 ± 0.009 0.831 ± 0.021 0.870 ± 0.001 0.822 ± 0.010 0.827 ± 0.017 0.892 ± 0.020

CR IP T

NMI 0.723 ± 0.008 0.628 ± 0.028 0.632 ± 0.009 0.766 ± 0.005 0.750 ± 0.008 0.727 ± 0.006 0.791 ± 0.010 0.714 ± 0.007 0.752 ± 0.001 0.730 ± 0.002 0.799 ± 0.001 0.785 ± 0.015 0.797 ± 0.005 0.793 ± 0.010 0.775 ± 0.018 0.829 ± 0.020

AN US

Method BestSV ConcatFea ConcatPCA Co-Train [18] GDBNSC [36] NSSRD [34] SGFS [33] DFSC [35] MultiNMF [24] NaMSC [15] DiMSC [5] MVSC [9] DMVMF [48] ECMSC [43] SWMC [30] DLRSE

improves the performance by 3.1% in NMI, 2.7% in ACC, 4.7% in AR, 4.6% in F-score. For the Extended YaleB dataset, DLRSE improves the performance by 2.2% in NMI, 0.5% in ACC, 4.9% in AR, 3.9% in F-score. For the ORL dataset,

M

DLRSE raises performance by 0.5% in NMI, 2.5% in ACC, 1.6% in AR, 2.8% in F-score. For the video face dataset Notting-Hill, the clustering task becomes more

ED

challenging because the face images always vary significantly due to the different lighting conditions. From Table 4, we can observe that DLRSE outperforms other

PT

methods by 2.3% in NMI, 1.9% in ACC, 2.2% in AR, 3.4% in F-score. We would also like to highlight several other aspects of the experimental re-

CE

sults.

• Two naive multi-view feature fusion methods ConcatFea and ConcatPCA

AC

fail to achieve good clustering performance, and perform even worse than BestSV. Since different views have different physical meanings, it is difficult to leverage complementary information effectively by simply concatenating features.

27

ACCEPTED MANUSCRIPT

• The single-view clustering and feature selection methods (i.e., GDBNSC, NSSRD, SGFS and DFSC) achieve better clustering results than BestSV,

CR IP T

ConcatFea, ConcatPCA and Co-train. GDBNSC leverages the global geometrical structure and global discriminative structure to improve the clustering quality. NSSRD, SGFS and DFSC can select effective and discrimina-

tive features to obtain better clustering results. However, they consider little about the multi-view complementary information, resulting inferior per-

AN US

formance compared with multi-view clustering methods, e.g., MultiNMF, NaMSC, DLRSE.

• Deep matrix factorization based methods (e.g., DMVMF and DLRSE) perform better than classical matrix factorization method MultiNMF. This is due to deep matrix factorization can learn complex hierarchical structure

M

and capture diversified attributes of data, which the classical one-level ma-

ED

trix factorization cannot reveal.

• DLRSE achieves better performance than the other three subspace cluster-

PT

ing methods DiMSC, MVSC and ECMSC. ECMSC generally performs better than DiMSC and MVSC by considering exclusivity and consistency simultaneously. DLRSE further improves performance by introducing multi-

CE

view multi-layer ensemble learning, which can obtain a more accurate clus-

AC

tering structure by aggregating multiple hidden clustering structures.

• DLRSE performs better than another deep matrix factorization based method DMVMF, which indicates that intermediate layers can provide hierarchical clustering structure and take advantage of intermediate layers in the deep model is benefit for exploring the intrinsic structure of data. 28

ACCEPTED MANUSCRIPT

• SWMC learns a similarity graph from multiple views without using a hy-

perparameter. Although convenient, it fails to achieve better performance

CR IP T

than DLRSE. The main reason is that the clustering results of SWMC rely on the quality of the similarity graph of each view. DLRSE can extract

more abstract representation and diversified clustering structures through deep structure, so that more accurate clustering structures can be obtained. 5.5. Sensitivity Analysis

AN US

We study how the performance varies with different parameter settings. All the experiments are repeated 10 times, and the averaged performance is reported.

Influence of the size of layers. To investigate how the clustering performance of DLRSE vary with the size of layers, we construct 1-layer model, 2-layer model

M

and 3-layer model, and make the size of each layer increase by 20, 50 and 100, respectively. Thus, 9 models are obtained, i.e., step size with 20: (20), (40-20),

ED

(60-40-20); step size with 50: (50), (100-50), (150-100-50); step size with 100: (100), (200-100), (300-200-100). The clustering results of each model on Extended YaleB dataset are shown in Figure 3. We can see that the step size with 50

PT

achieves better performance than the other two step sizes. Step size with 20 fails to perform good because the dimensions of layers are so small that they cannot

CE

fully capture the information of multi-view data. Step size with 100 obtains comparable results as step size with 50, however, the model contains a lot of redundant

AC

information that is not helpful for the clustering task. Hence, the step size of each model is set to 50 in our experiments.

Influence of the number of layers. The clustering results of DLRSE with different number of layers (i.e., m = {1, 2, 3, 4}) are shown in Figure 4. For all the datasets, we set layers as (50), (100-50), (150-100-50) and (200-150-100-50). 29

ACCEPTED MANUSCRIPT

0.9 1-layer model 2-layer model 3-layer model

0.85 0.8

CR IP T

0.75

NMI

0.7 0.65 0.6 0.55 0.5

0.4

Step size: 20

AN US

0.45 Step size: 50

Step size: 100

Figure 3: The clustering performance with different size of layers.

M

It can be observed that the performance for 1-layer model or 2-layer model are higher than 1-layer model in all the datasets. This verifies that the deep model can

ED

reveal complex and unknown attributes contained in data and achieve better clustering results. When m = 4, the performances are slightly degraded, especially on

PT

the Yale dataset. This is because 4-layer model has more parameters that it cannot be properly trained with limited training data. In the following experiments, we

CE

adopt 2-layer structure (100-50) for the Yale, Extended YaleB and Notting-Hill datasets, and adopt 3-layer structure (150-100-50) for the ORL dataset. Influence of intermediate variables. To analyze the benefit of learning the inter(v)

(v)

AC

mediate variables {Si }Vv=1 of DLRSE, we revise Ou by removing {S1 , · · · , (v)

(v)

(v)

(v)

Si−1 , Si+1 , · · · , Sm }Vv=1 and only learning {Si }Vv=1 and F . We adopt layer

structure (100-50) for Yale, Extended YaleB and Notting-Hill datasets, and (150100-50) for ORL dataset. The detailed results are shown in Figure 5. It can be 30

ACCEPTED MANUSCRIPT

1

0.9

1-layer model 2-layer model 3-layer model 4-layer model

CR IP T

0.95

NMI

0.85 0.8 0.75 0.7

0.6

Yale

AN US

0.65

Extended YaleB

ORL

Notting-Hill

Figure 4: The clustering performance with different number of layers.

M

seen that using all the layers outperforms using only one layer. Specifically, using all the layers improves 5.22% and 2.48% than using only layer 1 or layer 2 in

ED

terms of NMI on Yale dataset, improves 2.02% and 2.43% than using only layer 1 or layer 2 in terms of NMI on Extended YaleB dataset, improves 1.8%, 1.21%

PT

and 0.97% than using only layer 1, layer 2 or layer 3 in terms of NMI on ORL dataset, and improves 3.4% and 1.87% than using only layer 1 or layer 2 in terms

CE

of NMI on Notting-Hill dataset. As such, different layers contain complementary (v)

information, and jointly leveraging the intermediate variables {Si }Vv=1 can help

AC

improve the overall performance.

Influence of subspace and ensemble learning. To study the influence of modules in (7), we focus on two important parameters λ2 and λ3 . From the solution of

α in (25) and (26), the sparsity of α is affected by the value of λ3 /λ2 . So, we tune µ = λ3 /λ2 instead of λ3 in the experiment. The sensitivity analysis on Extended 31

ACCEPTED MANUSCRIPT

1.00 Layer 1 Layer 2 Layer 3 All

0.95 0.90

CR IP T

NMI

0.85 0.80 0.75 0.70 0.65

Yale

AN US

0.60

Extended YaleB

ORL

Notting-Hill

Figure 5: The clustering performance using each individual layer.

M

YaleB is shown in Figure 6. We can observe that the performance is relatively stable with λ2 . In other words, the promising performance can be obtained in a

ED

wide range for λ2 ∈ [10−6 , 5 × 10−3 ]. µ controls the sparsity of coefficients α.

α becomes very sparse when µ < 103 , and the performance are degraded since a lot of information are discounted. For µ = 0, the group sparsity regularization

PT

is disabled, which leads to a completely sparse solution and only one subspace is selected for clustering. The complementary information among different sub-

CE

spaces cannot be used so that the clustering performance is seriously influenced. When µ > 107 , α tends to be uniform. Each subspace is considered to be equally

AC

important and the performance are also limited. The appropriate range of µ is [103 , 107 ]. Influence of the ensemble representation. As shown in Figure 7, we present how the clustering performance varies with different dimensions c of F in (7).

32

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 6: Parameter sensitivity analysis on Extended YaleB.

We enumerate c from {20, 30, · · · , 70} for the Yale, Extended YaleB and ORL

M

datasets, and from {5, 10, · · · , 30} for the Notting-Hill dataset. Generally, the clustering performance is not sensitive to c. The values of c for each dataset are,

ED

c = 30 for the Yale and ORL datasets, c = 60 for the Extended YaleB dataset and c = 25 for the Notting-Hill dataset.

PT

6. Conclusion

CE

In this paper, a deep low-rank subspace ensemble method (DLRSE) is developed for multi-view clustering. Different from the existing deep multi-view clustering methods that mainly rely on one layer of the deep model, DLRSE fur-

AC

ther improves the clustering performance by leveraging the clustering information contained in multiple layers. First, DLRSE learns multi-layer low-rank subspaces from multi-view data, so that more comprehensive and diversified clustering information can be obtained. Then, to capture the intrinsic clustering structure of data, 33

ACCEPTED MANUSCRIPT

1

CR IP T

0.9

NMI

0.8

0.7

Yale Extended YaleB ORL Notting-Hill

0.5

1

2

AN US

0.6

3

4

5

6

c

Figure 7: Clustering performance with different dimensions of F on each dataset.

M

a consensus subspace representation is obtained by subspace ensemble learning, where structured sparsity regularization is utilized to enhance the effective sub-

ED

spaces and suppress unreliable ones. Extensive experiments on several datasets demonstrate the effectiveness of the proposed method. It would be interesting to extend our deep model and optimization strategy to handle dynamic data and

PT

achieve online multi-view clustering, which will be explored in the future work.

CE

Acknowledgment

This work was supported in part by National Natural Science Foundation of

AC

China: 61802028, 61532006 and 61772083, in part by the Fundamental Research Funds for the Central University (No.2018RC44), in part by the Director Foundation of Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia (NO.ITSM20180102), in part by US NSF IIS-1816227.

34

ACCEPTED MANUSCRIPT

Appendix In this section, we present some theoretical analysis and mathematical deriva-

CR IP T

tions about the update rules in Section 4. (v)

Theoretical Analysis on Update Rule for Hi

(v)

We provide a brief analysis on the effectiveness of update rule for Hi

in

(11). By using auxiliary function method similar to that used in [21, 7], we can

AN US

prove that the objective function in (7) decreases monotonically under (11), which illustrates that the iteration of the update rule of (11) converges. Thus, we focus on proving the correctness of update rule (11), i.e., the solution produced by (11)

satisfies the KKT optimality condition. The Lagrangian function is introduced as follows, T

(v)

(v)

(v)T

(v)T

− βHi

(v)

Hi

(v)T

− 2Hi

(v)T

+ Hi

(v)

(v)

Hi Si

(v)

ΨT ΨHi (v)T

+ Si

(v)T

Hi

(v)

(v)

(27)

Hi Si )

ED

+ λ1 (Hi

M

L(Hi ) = T r[−2X (v) ΨHi ],

(v)

PT

where the Lagrangian multipliers β enforce non-negative constraints, i.e., Hi

>

0. The zero gradient condition gives (v)

AC

CE

∂L(Hi ) (v) ∂Hi

(v)

(v)

= − 2ΨT X (v) + 2ΨT ΨHi −

(v) (v) 2Hi Si

+

+ 2λ1 (Hi

(v) (v) (v)T Hi Si S i )

(28)

− β = 0.

From the complementary slackness condition, we have (v)

[−2ΨT X (v) + 2ΨT ΨHi (v)

(v)

(v)T

+ Hi Si S i

(v)

+ 2λ1 (Hi

(v)

(v)

(v)

)]jk [Hi ]jk = βjk [Hi ]jk = 0. 35

(v)

− 2Hi Si

(29)

ACCEPTED MANUSCRIPT

This is a fixed point equation that the solution must satisfy at convergence. It is obvious that the limiting solution of update rule (11) satisfies the fixed (v)

(v)

(v)

(v)

CR IP T

point equation. At convergence, [Hi ](∞) = [Hi ](t+1) = [Hi ](t) = Hi , i.e., s [Φn ]jk (v) (v) (30) [Hi ]jk = [Hi ]jk , [Φd ]jk where (v)

(v)

(v)

Φn = [ΨT X (v) ]pos + [ΨT ΨHi ]neg + 2λ1 [Hi Si ]pos (v)

(v)

(v) T neg

(v)

+λ1 [Hi ]neg + λ1 [Hi Si Si

,

AN US

]

(v)

(v)

(v)

Φd = [ΨT X (v) ]neg + [ΨT ΨHi ]pos + 2λ1 [Hi Si ]neg (v)

(v)

(v) T pos

(v)

+λ1 [Hi ]pos + λ1 [Hi Si Si Note

]

.

ΨT X (v) = [ΨT X (v) ]pos − [ΨT X (v) ]neg , (v)

(v)

Hi Si

(v)

(v)

(v)

(v)

(31)

(v)

(v)

= [Hi ]pos − [Hi ]neg ,

(v)

(v)

(v) T

Hi Si Si

(v)

(v) T pos

(v)

= [Hi Si Si

]

PT

Thus, (30) reduces to

(v)

[−2ΨT X (v) + 2ΨT ΨHi

CE

(v)

= [Hi Si ]pos − [Hi Si ]neg ,

ED

(v)

Hi

(v)

= [ΨT ΨHi ]pos − [ΨT ΨHi ]neg ,

M

(v)

ΨT ΨHi

+

(v)

(v)

+ 2λ1 (Hi

(v) (v) (v)T (v)2 Hi Si S i )]jk [Hi ]jk

(v)

(v) T neg

− [Hi Si Si

(v)

]

.

(v)

− 2Hi Si

(32)

= 0.

AC

(32) is identical to (29). Thus, if (29) holds, (32) also holds and vice versa. Thus the limiting solution of the update rule of (11) satisfies the KKT condition.

36

ACCEPTED MANUSCRIPT

Derivations from (22) to (23) The derivations from (22) to (23) are given as follows, (v)

v=1 i=1 V P m P

=

= =

(v) T

(v)

(v)

αi T r(F − F Si ) (F − F Si )

v=1 i=1 V P m P

αi T r(F T F − 2F T F Si

v=1 i=1 V P m P

αi T r(F LF T ).

v=1 i=1 V P m P

v=1 i=1

(v)

(v)

(v)

(v)

αi T r(F (I − 2Si (v)

Derivations for Update Rule for α

(v) T

+ Si

(v)

(v) T

+ Si S i

(v)

(33)

F T F Si )

)F T )

AN US

=

(v)

αi ||F − F Si ||2F

CR IP T

V P m P

The derivations for update rule (25) and (26) are presented here. We use coordinate decent method and only two variables are updated in each iteration. There (p)

(q)

M

are two situations should be considered: the two variables αj and αk belong to the same group (p = q) or they belong to different groups (p 6= q). We give the

ED

derivations of the former situation p = q, and the latter one follows similar steps so we omit it. The objective function (24) can be written as (p) (p)

(q) (q)

PT

f = λ2 (αj dj + αk dk ) q (p) 2 (p) 2 (q) 2 (p) 2 +λ3 α1 + ... + αj + ... + αk + ... + αm . (p)

(q)

(34) (p)

CE

According to the equality constraint 1T α = 1, we have αj + αk = 1 − α1 − (p)

(q)

(p)

AC

... − αn = c, so αk = c − αj . Then the objective function changes to (p) (p)

(p)

(q)

f = λ2 [αj dj + (c − αj )dk ] q (p) 2 (p) 2 (p) 2 (p) 2 +λ3 α1 + ... + αj + ... + (c − αj ) + ... + αm

(35)

(p)

The optimal value of αj is obtained by setting ∂(f ) (p) ∂(αj )

λ3 (p) (q) (p) (p) = λ2 (dj − dk ) + p (αj + αj − c) = 0, ξp 37

(36)

ACCEPTED MANUSCRIPT

Pm

i=1

(p)

(p)

(q)

(p)

(αi )2 , c = αj + αk . So, the solution of αj is (p)∗ αj

p (p) (q) (q) (p) λ(αj + αk ) + ξp (dk − dj ) = . 2λ (q)∗

By using the equality constraint, we have αk

(p)

(37)

CR IP T

where ξp =

(q)

(p)∗

= αj + αk − αj

References

AN US

References

.

[1] Andrew, G., Arora, R., Bilmes, J., Livescu, K., 2013. Deep canonical correlation analysis. In: ICML. pp. 1247–1255.

[2] Belhumeur, P. N., Hespanha, J. P., Kriegman, D. J., 1997. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. TPAMI 19 (7),

M

711–720.

ED

[3] Cai, J.-F., Cand`es, E. J., Shen, Z., 2010. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20 (4), 1956–

PT

1982.

[4] Cai, X., Nie, F., Huang, H., Kamangar, F., 2011. Heterogeneous image fea-

CE

ture integration via multi-modal spectral clustering. In: CVPR. IEEE, pp. 1977–1984.

AC

[5] Cao, X., Zhang, C., Fu, H., Liu, S., Zhang, H., 2015. Diversity-induced multi-view subspace clustering. In: CVPR. pp. 586–594.

[6] Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection. In: CVPR. pp. 886–893. 38

ACCEPTED MANUSCRIPT

[7] Ding, C., Li, T., Jordan, M. I., 2010. Convex and semi-nonnegative matrix factorizations. TPAMI 32 (1), 45–55.

ory, and applications. TPAMI 35 (11), 2765–2781.

CR IP T

[8] Elhamifar, E., Vidal, R., 2013. Sparse subspace clustering: Algorithm, the-

[9] Gao, H., Nie, F., Li, X., Huang, H., 2015. Multi-view subspace clustering. In: ICCV. pp. 4238–4246.

AN US

[10] Georghiades, A. S., Belhumeur, P. N., Kriegman, D. J., 2001. From few to many: illumination cone models for face recognition under variable lighting and pose. TPAMI 23 (6), 643–660.

[11] G¨onen, M., Margolin, A. A., 2014. Localized data fusion for kernel k-means

M

clustering with application to cancer biology. In: NIPS. pp. 1305–1313. [12] He, K., Zhang, X., Ren, S., Sun, J., June 2016. Deep residual learning for

ED

image recognition. In: CVPR. pp. 770–778. [13] He, X., Kan, M.-Y., Xie, P., Chen, X., 2014. Comment-based multi-view

PT

clustering of web 2.0 items. In: WWW. pp. 771–782. [14] Hong, C., Yu, J., You, J., Chen, X., Tao, D., 2015. Multi-view ensemble

CE

manifold regularization for 3d object recognition. Information Sciences 320,

AC

395 – 405.

[15] Hu, H., Lin, Z., Feng, J., Zhou, J., 2014. Smooth representation clustering. In: CVPR. pp. 3834–3841.

[16] Huang, H.-C., Chuang, Y.-Y., Chen, C.-S., 2012. Affinity aggregation for spectral clustering. In: CVPR. pp. 773–780. 39

ACCEPTED MANUSCRIPT

[17] Jiao, L., Shang, F., Wang, F., Liu, Y., 2012. Fast semi-supervised clustering with enhanced spectral embedding. Pattern Recognition 45 (12), 4358 –

CR IP T

4369. [18] Kumar, A., Daum´e, H., 2011. A co-training approach for multi-view spectral clustering. In: ICML-11. pp. 393–400.

[19] Kumar, A., Rai, P., Daume, H., 2011. Co-regularized multi-view spectral

AN US

clustering. In: NIPS. pp. 1413–1421.

[20] Lades, M., Vorbruggen, J. C., Buhmann, J., Lange, J., Von Der Malsburg, C., Wurtz, R. P., Konen, W., 1993. Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on computers 42 (3),

M

300–311.

[21] Lee, D. D., Seung, H. S., 2001. Algorithms for non-negative matrix factor-

ED

ization. In: NIPS. pp. 556–562.

[22] Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y., 2013. Robust recovery of

PT

subspace structures by low-rank representation. TPAMI 35 (1), 171–184. [23] Liu, J., Chen, Y., Zhang, J., Xu, Z., 2014. Enhancing low-rank subspace

CE

clustering by manifold regularization. TIP 23 (9), 4022–4030.

AC

[24] Liu, J., Wang, C., Gao, J., Han, J., 2013. Multi-view clustering via joint nonnegative matrix factorization. In: SDM. Vol. 13. pp. 252–260.

[25] Lowe, D. G., 2004. Distinctive image features from scale-invariant keypoints. IJCV 60 (2), 91–110.

40

ACCEPTED MANUSCRIPT

[26] Lyu, S., Wang, X., 2013. On algorithms for sparse multi-factor nmf. In: NIPS. pp. 602–610.

CR IP T

[27] Nie, F., Cai, G., Li, J., Li, X., 2018. Auto-weighted multi-view learning for image clustering and semi-supervised classification. IEEE Transactions on Image Processing 27 (3), 1501–1511.

[28] Nie, F., Cai, G., Li, X., 2017. Multi-view clustering and semi-supervised

AN US

classification with adaptive neighbours. In: AAAI. pp. 2408–2414.

[29] Nie, F., Li, J., Li, X., 2016. Parameter-free auto-weighted multiple graph learning: A framework for multiview clustering and semi-supervised classification. In: IJCAI. pp. 1881–1887.

M

[30] Nie, F., Li, J., Li, X., 2017. Self-weighted multiview clustering with multiple graphs. In: IJCAI. pp. 2564–2570.

ED

[31] Ojala, T., Pietikainen, M., Maenpaa, T., 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns.

PT

TPAMI 24 (7), 971–987.

[32] Samaria, F., Harter, A., 1994. Parameterisation of a stochastic model for hu-

CE

man face identification. Proceedings of 2nd IEEE Workshop on Applications

AC

of Computer Vision, 138–142.

[33] Shang, R., Wang, W., Stolkin, R., Jiao, L., 2016. Subspace learning-based graph regularized feature selection. Knowledge-Based Systems 112, 152 – 165.

41

ACCEPTED MANUSCRIPT

[34] Shang, R., Wang, W., Stolkin, R., Jiao, L., 2018. Non-negative spectral learning and sparse regression-based dual-graph regularized feature selec-

CR IP T

tion. IEEE Transactions on Cybernetics 48 (2), 793–806. [35] Shang, R., Zhang, Z., Jiao, L., Liu, C., Li, Y., 2016. Self-representation based dual-graph regularized feature selection clustering. Neurocomputing 171, 1242 – 1253.

AN US

[36] Shang, R., Zhang, Z., Jiao, L., Wang, W., Yang, S., 2016. Global discriminative-based nonnegative spectral clustering. Pattern Recognition 55, 172–182.

[37] Srivastava, N., Salakhutdinov, R., 2014. Multimodal learning with deep

M

boltzmann machines. Journal of Machine Learning Research 15, 2949–2980. [38] Trigeorgis, G., Bousmalis, K., Zafeiriou, S., Schuller, B. W., 2014. A deep

1700.

ED

semi-nmf model for learning hidden representations. In: ICML. pp. 1692–

PT

[39] Vidal, R., 2011. Subspace clustering. SPM 28 (2), 52–68. [40] Wang, H., Nie, F., Huang, H., 2013. Multi-view clustering and feature learn-

CE

ing via structured sparsity. In: ICML. pp. 352–360.

AC

[41] Wang, H., Nie, F., Huang, H., Risacher, S. L., Saykin, A. J., Shen, L., Initiative, A. D. N., 2012. Identifying disease sensitive and quantitative traitrelevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics 28 (12), i127– i136. 42

ACCEPTED MANUSCRIPT

[42] Wang, W., Arora, R., Livescu, K., Bilmes, J., 2015. On deep multi-view representation learning. In: ICML. pp. 1083–1092.

CR IP T

[43] Wang, X., Guo, X., Lei, Z., Zhang, C., Li, S. Z., 2017. Exclusivityconsistency regularized multi-view subspace clustering. In: CVPR. pp. 1–9.

[44] Yang, J., Parikh, D., Batra, D., 2016. Joint unsupervised learning of deep representations and image clusters. In: CVPR. pp. 5147–5156.

AN US

[45] Zhang, C., Fu, H., Liu, S., Liu, G., Cao, X., 2015. Low-rank tensor constrained multiview subspace clustering. In: ICCV. pp. 1582–1590.

[46] Zhang, X., Gao, H., Li, G., Zhao, J., Huo, J., Yin, J., Liu, Y., Zheng, L., 2018. Multi-view clustering based on graph-regularized nonnegative matrix

M

factorization for object recognition. Information Sciences 432, 463 – 478. [47] Zhang, Y.-F., Xu, C., Lu, H., Huang, Y.-M., 2009. Character identification in

1288.

ED

feature-length films using global face-name matching. TMM 11 (7), 1276–

PT

[48] Zhao, H., Ding, Z., Fu, Y., 2017. Multi-view clustering via deep matrix fac-

AC

CE

torization. In: AAAI. pp. 2921–2927.

43