Multi-view spectral clustering via integrating nonnegative embedding and spectral embedding

Multi-view spectral clustering via integrating nonnegative embedding and spectral embedding

Information Fusion 55 (2020) 251–259 Contents lists available at ScienceDirect Information Fusion journal homepage: www.elsevier.com/locate/inffus ...

1MB Sizes 0 Downloads 59 Views

Information Fusion 55 (2020) 251–259

Contents lists available at ScienceDirect

Information Fusion journal homepage: www.elsevier.com/locate/inffus

Multi-view spectral clustering via integrating nonnegative embedding and spectral embedding Zhanxuan Hu a,c, Feiping Nie a,c, Rong Wang b,c,∗, Xuelong Li a,c a

School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, Shaanxi, PR China School of Cybersecurity, Northwestern Polytechnical University, Xi’an 710072, Shaanxi, PR China c Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, Shaanxi, PR China b

a r t i c l e

i n f o

Keywords: Clustering Multi-view Majorization-Minimization Nonnegative matrix factorization

a b s t r a c t The application of most existing multi-view spectral clustering methods is generally limited by the following three deficiencies. First, the requirement to post-processing, such as K-means or spectral rotation. Second, the susceptibility to parameter selection. Third, the high computation cost. To this end, in this paper we develop a novel method that integrates nonnegative embedding and spectral embedding into a unified framework. Two promising advantages of proposed method include 1) the learned nonnegative embedding directly reveals the consistent clustering result, such that the uncertainty brought by post-processing can be avoided; 2) the involved model is parameter-free, which makes our method more applicable than existing algorithms that introduce many additional parameters. Furthermore, we develop an efficient inexact Majorization-Minimization method to solve the involved model which is non-convex and non-smooth. Experiments on multiple benchmark datasets demonstrate that our method achieves state-of-the-art performance.

1. Introduction Clustering is a fundamental problem that arises in many fields, including data mining, computer vision, and machine learning. Conceptually, given n samples, the goal of clustering is to partition them into k subsets. In general, for each sample one can describe it from different views, and leveraging the information of multiple views simultaneously is beneficial for achieving a better clustering result. Such a problem refers to multi-view clustering. Most existing approaches for dealing with this issue can be roughly divided into four groups, including graphbased methods [1–6], matrix factorization methods [7–11], multiple kernel-based methods [12–14], and subspace learning-based methods [15,16]. In particular, due to the ability in digging non-linear structure information, graph-based methods generally outperform others in terms of clustering precision. Different graph-based methods usually vary in how they learn the consistent spectral embedding. One simple alternative is multi-view spectral clustering that directly uses multiple graphs, constructed by k-nearest neighbors (or other similar approaches), to learn a consistent spectral embedding [1,2,17]. However, the noise levels of different views are generally different, which implies that the quality difference exists in different graphs. To alleviate this issue, multi-view subspace clustering [18–20] searches for first learning a consistent similarity matrix, i.e., a graph from multiple views, and then obtaining the final spec∗

tral embedding via traditional single-view spectral embedding methods, such as Ncut. There are three deficiencies that are usually encountered in graphbased methods. First, the requirement to post-processing. For graphbased methods, the final clustering result is generally returned by conducting K-means or spectral rotation to consistent spectral embedding, which inevitably introduce the uncertainty caused by initialization. Second, the susceptibility to parameter selection. The models of most existing graph-based methods introduce some additional parameters, while parameter selection is not an easy thing for clustering, an unsupervised task. Third, the high computation cost. Eigenvalue decomposition with computation complexity (𝑛3 ) is required by most multi-view spectral clustering methods, and matrix inversion with computation complexity (𝑛3 ) is required by most multi-view subspace clustering methods when solve the involved models. Both eigenvalue decomposition and matrix inversion are time consuming for large scale data. Another line of studies focus on the matrix factorization methods [7,9,21,22] that generally provide a large advantage over graph-based methods in terms of time cost. However, these methods cannot tackle with the data with non-linear structure. In short, graph-based methods perform better but are limited by the high computation cost. On the other hand, matrix factorization methods are efficient but unable to provide a satisfactory clustering result. Such an issue motivates us to combine the advantages of them. In this work, we propose to implement

Corresponding author. E-mail addresses: [email protected] (Z. Hu), [email protected] (R. Wang).

https://doi.org/10.1016/j.inffus.2019.09.005 Received 8 January 2019; Received in revised form 4 September 2019; Accepted 9 September 2019 Available online 10 September 2019 1566-2535/© 2019 Elsevier B.V. All rights reserved.

Z. Hu, F. Nie and R. Wang et al.

Information Fusion 55 (2020) 251–259

Table 1 Notation. m

Dimensionality of instance

n k c 𝑿 (𝑣) = {𝒙(1𝑣) , … , 𝒙(𝑛𝑣) } 𝒙(𝑖𝑣) 𝑋𝑖𝑗(𝑣) √ ‖𝑿 ‖𝐹 = 𝑇 𝑟(𝑿 𝑿 𝑇 )

Number of instances Number of clusters Number of views Data matrix of v view The ith instance of X(v) The (i, j)th element of X(v)

minimization problem. Recently, numerous algorithms, based on structure graph learning, are developed [4,25–27], and all of them aim to learn a consistent similarity matrix with k connected components from c graphs. Such an idea is a generalization of single view spectral clustering [28,29]. Instead of learning a consistent similarity matrix or spectral embedding, AWP (Adaptively Weighted Procrustes) [30] aims at learning a consistent indicator matrix via spectral rotation. AWP provides a large advantage over other graph-based methods in both time cost and clustering precision. Unlike the above methods that use multiple graphs constructed by k-nearest neighbors (or other similar approaches), multi-view subspace clustering proposes to learn a consistent representation matrix from different views. Actually, it is a generalization of single view subspace clustering and refers to MVSC (Multi-View Subspace Clustering). The seminal work for MVSC is proposed in [19], in which the authors exploit the complementarity of multiple representation matrices via introducing an Hilbert Schmidt Independence Criterion. Subsequently, a tensorbased method is proposed in [18], which solves a tensor nuclear norm minimization model to learn the consistent similarity matrix. In order to preserve the consistence among different views, Nie et al. [20] integrate representation learning and consistent spectral embedding into a unified formulation. Unlike the above methods that learn similarity matrix in raw space, LMVSC (Latent Multi-View Subspace Clustering) [31] seeks the consistent representation in latent space. Perhaps the main limitation of most existing graph-based methods is the high time cost caused by the eigenvalue decomposition or matrix inversion. To tackle this issue, Li et al. [32] introduce the bipartite graph technique [33] to multi-view clustering. Despite promising results in decreasing time cost, this algorithm disregards considerable information. Furthermore, the performance of several graph-based algorithms, including bipartite-based method, heavily depends on the parameters selection and post-processing. Another representative method for fast multi-view clustering is HSIC [34] which unifies the hash learning and clustering into a unified framework. But, the application of HSIC may be limited by the additional parameters that are introduced in the model. Besides, an alternative for efficient multi-view clustering is matrix factorization technique. The traditional K-means algorithm has been generalized to cope with the multi-view settings [9,21]. Moreover, two algorithms based on nonnegative matrix factorization have been proposed in [7,22]. Compared with graph-based methods, the algorithms based on matrix factorization are more efficient and suitable for tackling large scale data. Nevertheless, these algorithms cannot deal with the data with non-linear structure. Recently, Wen et al. [35] provide a novel graph regularized matrix factorization method IMC-GRMF, and use it to deal with multi-view clustering and classification tasks [35]. Instead of introducing a graph regularization term to matrix factorization as IMC-GRMF, in this work we propose to directly utilize the connection between graph-based methods and matrix factorization methods [23]. Specifically, we want to combine the advantages of both Spectral Clustering (SC) and symmetric nonnegative matrix factorization (SymNMF). Next, we temporarily turn our attention to these two topics.

The Frobenius norm of X

spectral embedding and nonnegative embedding simultaneously. Our basic idea is partially motivated by Kuang et al. [23], where the relation between symmetric nonnegative matrix factorization and spectral clustering is discussed in single-view case. The main contributions of this work can be summarized as follows: •





We provide a novel multi-view spectral clustering algorithm, namely NESE (multi-view spectral clustering via integrating Nonnegative Embedding and Spectral Embedding). It inherits the advantages of both graph-based and matrix factorization methods. Specifically, the model of NESE is parameter-free, which makes it more applicable than existing methods. Moreover, the solution returned by NESE directly reveals the consistent clustering result. Such that the uncertainty brought by post-processing, such as K-means and spectral rotation can be avoided. We provide an efficient optimization approach, namely inexact Majorization-Minimization (inexact-MM), to solve the non-convex and non-smooth objective involved in NESE. The computation complexity of inexact-MM is approximately (𝑛𝑘2 ), where n and k are the number of samples and clusters, respectively. We conduct numerous experiments to verify the performance of NESE. And the experimental results demonstrate that our method can achieve comparable and even better clustering results. We provide the datasets and code in https://github.com/sudalvxin/SMSC. git.

We report the notations that are widely used in this paper in Table 1. The remainder of this work is organized as follows. We introduce the related works in Section 2, and present proposed method NESE in Section 3. The optimization details w.r.t NESE is summarized in Section 4. We report comparison results in Section 5, and conclude this work in Section 6. 2. Related work In this section, we first review some representative methods for multi-view clustering, and then introduce the studies that are related to our method. 2.1. Multi-view clustering Co-training Spectral Clustering (CotSC) [1] is one of the seminal works on graph-based multi-view clustering. It assumes that each instance should be partitioned into the same cluster in all views, and achieves this goal via alternately modifying the similarity matrix using clustering results provided by other views. Another related technique is Co-regularized Spectral Clustering (CorSC) [2], where two different regularization terms are introduced to enforce the consistency among different views. To tackle the quality difference among different graphs, Xia et al. [17] develop a weighted spectral embedding algorithm via assigning a specific weight for each view. And, a similar idea was used in [24]. In order to avoid introducing additional weight parameters, Nie et al. [3] propose a parameter-free model, which can learn the weights of all views automatically by solving a square-root trace

2.2. Spectral clustering and symmetric nonnegative matrix factorization Spectral clustering (SC): Assume that we are given a set of data points 𝒙1 , … , 𝒙𝑛 . The raw information will be translated as a graph  = {, } in spectral clustering, where  is a set recording the nodes, and  is a set recording the edges. Each node of  represents a data point, and each edge of  represents the similarity among two data points. The goal of spectral clustering is to partition  into k subsets {1 , … , 𝑘 }. Suppose S is a similarity matrix, and Sij denotes the similarity, i.e., edge weight between nodes i and j. There are various graph cut algorithms, while in this paper we mainly focus on the Normalized cut (Ncut) [36], for it generally outperforms others in terms of clustering precision. Note 252

Z. Hu, F. Nie and R. Wang et al.

Information Fusion 55 (2020) 251–259

that the result provided in this paper is also suitable for other graph cut methods. Suppose 𝑫 = 𝑑𝑖𝑎𝑔(𝑑1 , … , 𝑑𝑛 ) is the degree matrix and 𝑳 = 𝑫 − 𝑺 is ∑ the Laplacian matrix, where 𝑑𝑖 = 𝑛𝑗=1 𝑆𝑖𝑗 is the degree of node i. Correspondingly, the objective of relaxed Ncut is min

𝒁 𝑇 𝑫 𝒁 =𝑰

𝑇 𝑟 ( 𝒁 𝑇 𝑳𝒁 ) ,

Contrarily, the solution of SymNMF directly reveals the cluster structure. Hence, we try to combine the objectives of both SymNMF and Ncut. Specifically, the new objective for single-view spectral clustering is min ‖𝑷 − 𝑯 𝑭 𝑇 ‖2𝐹 𝑭 ,𝑯

𝑠.𝑡. 𝑯 ≥ 0, 𝑭 𝑇 𝑭 = 𝑰 ,

(1)

− 12

where 𝑰 ∈ ℝ𝑛×𝑛 represents an identity matrix, and 𝒁 ∈ ℝ𝑛×𝑘 denotes the

where 𝑷 = 𝑫 𝑺 𝑫 , 𝑭 ∈ ℝ𝑛×𝑘 denotes the spectral embedding, and 𝑯 ∈ ℝ𝑛×𝑘 denotes the nonnegative embedding. Following [23], we assign the instance i to the class corresponding to the index of the largest entry in the ith row of H.

1 2

relaxed variable. Let 𝑭 = 𝑫 𝒁 . The problem (1) can be reformulated as: 1

1

min 𝑇 𝑟(𝑭 𝑇 𝑫 − 2 𝑳𝑫 − 2 𝑭 ) .

(2)

𝑭 𝑇 𝑭 =𝑰

3.2. Connection with previous studies

For 𝑳 = 𝑫 − 𝑺 , the problem (2) is equivalent to 1

1

max 𝑇 𝑟(𝑭 𝑇 𝑫 − 2 𝑺 𝑫 − 2 𝑭 ) .

A number of algorithms based on non-negative embedding have been developed for single-view clustering. In [23], the Eq. (5) is used to solve the single-view spectral clustering problem. A similar model has been proposed in [40], where the Eq. (5) is recast as

(3)

𝑭 𝑇 𝑭 =𝑰

The optimal solution of problem (3) consists of the k eigenvectors corre1

1

sponding to the top k eigenvalues of matrix 𝑫 − 2 𝑺 𝑫 − 2 . The matrix F can be considered as a spectral embedding of raw data, and the final clustering result is returned by K-means or spectral rotation. Nevertheless, both K-means and spectral rotation heavily depend on the initialization. Symmetric nonnegative matrix factorization (SymNMF): For a matrix 𝑴 ∈ ℝ𝑛×𝑛 , the objective of symmetric nonnegative matrix factorization is

min

𝑐∈ℝ+ ,𝑯 ∈ℝ𝑛×𝑘

(4)

min ‖𝑴 − 𝑯 𝑯 𝑇 ‖2𝐹 + 𝜆(𝑯 )

where H ≥ 0 denotes that all elements of H are nonnegative. Besides, the model (4) can be used to cope with the clustering problem via introducing a orthogonal constraint over H [37]. The connection between SymNMF and SC: The connection between SymNMF and SC has been investigated by previous studies [23,37–39]. Here, we briefly introduce the result reported in [23]. Recalling the objective of SymNMF, we can find that problem (4) is equivalent to the following formulation.

𝑯 ∈ℝ𝑛×𝑘 𝑇

𝑠.𝑡. 𝑯 𝑯 = 𝑰 , 𝑯 ≥ 0,

( ) ( ) min 𝑇 𝑟 𝑯 𝑯 𝑇 𝑯 𝑯 𝑇 − 2𝑇 𝑟 𝑯 𝑇 𝑴 𝑯 (5)

If we replace the constraint H ≥ 0 and the matrix M by 𝑯 𝑇 𝑯 = 𝑰 and

min ‖𝑷 − 𝑯 𝑭 𝑇 ‖2𝐹 + 𝜆‖𝑯 − 𝑭 ‖2𝐹

1

𝑫 − 2 𝑺 𝑫 − 2 , respectively, the problem (5) can be recast as max 𝑇 𝑟(𝑯 𝑇 𝑫

𝑯 ∈ℝ𝑛×𝑘

− 12

𝑺𝑫

− 12

𝑭 ,𝑯

𝑠.𝑡. 𝑯 ≥ 0, 𝑭 𝑇 𝑭 = 𝑰 .

𝑯)

𝑠.𝑡. 𝑯 𝑇 𝑯 = 𝑰 .

(10)

If we set the parameter 𝜆 as zero, the above objective is equivalent to Eq. (7). Particularly, the experimental results reported in [42] demonstrates that ONE generally achieves a better result when 𝜆 takes a small value. Hence, in this paper we set 𝜆 = 0 is reasonable. Besides, in [38] an orthogonal rotation matrix Q of size k × k is introduced between H and F. However, the matrix QF is orthogonal when both Q and F are orthogonal matrices. That is, these two objectives are actually equivalent. Next, we try to generalize the Eq. (7) to cope with the multi-view setting.

(6)

Obviously, the Eq. (6) is consistent with the relaxed objective of Ncut, i.e., the problem (3). That is both SC and SymNMF can be seen as the relaxations to min𝑯 ∈ ‖𝑴 − 𝑯 𝑯 𝑇 ‖2𝐹 , where 𝑯 ∈  denotes the constraint over H. Inspired by this investigation, Kuang et al. [23] use the objective (4) to solve the spectral clustering problem, where the matrix 1

(9)

where M is a refined similarity matrix, (𝑯 ) is regularization term w.r.t H, and the interpretation can be found in [41]. The main difference between the models mentioned above and the objective (7) proposed in this paper is that a new variable F is introduced. Actually, the objective (7) can be seen as a relaxed formulation that increases the flexibility in multi-view setting. Two works that are most relevant to proposed method are [42] and [38]. Specifically, in [42], the author propose to learn an Orthogonal and Nonnegative Embedding (ONE) via solving the following problem

𝑯 ∈ℝ𝑛×𝑘

𝑠.𝑡. 𝑯 ≥ 0.

(8)

where c is a scaling factor. Note that this model requires that input M is a similarity matrix. The solution H∗ is a cluster probability matrix, reflecting the final clustering result. Furthermore, Yang at al. [41] propose to enforce non-negative and orthogonal constraints simultaneously, and reformulate the Eq. (5) as

min ‖𝑴 − 𝑯 𝑯 𝑇 ‖2𝐹 𝑠.𝑡. 𝑯 ≥ 0,

‖𝑐 𝑴 − 𝑯 𝑯 𝑇 ‖2𝐹

𝑠.𝑡. 𝑯 ≥ 0, 𝑯 𝟏 = 𝟏,

𝑯 ∈ℝ𝑛×𝑘

1

(7)

− 12

1

M is replaced by a new similarity matrix 𝑫 − 2 𝑺 𝑫 − 2 . But, this model cannot deal with the multi-view clustering task.

3.3. Multi-view spectral clustering via integrating nonnegative embedding and spectral embedding

3. Proposed method

Suppose we are given c feature matrices {𝑿 (𝑣) }𝑐𝑣=1 describing the input data from c views. Naturally, we can construct c similarity matrices {𝑷 (𝑣) }𝑐𝑣=1 . As the goal of multi-view clustering is to find a consistent clustering result for all views, we build the following model

In this section, we first provide a model for single-view spectral clustering, and then generalize it to multi-view setting. 3.1. Single-view spectral clustering via integrating nonnegative embedding and spectral embedding

min

{𝑭 (𝑣) }𝑐𝑣=1 ,𝑯

For the solution returned by the relaxed Ncut is continuous, one has to perform a post-processing method to obtain the final clustering result.

𝑐 ∑ 𝑣=1

‖𝑷 (𝑣) − 𝑯 (𝑭 (𝑣) )𝑇 ‖2𝐹

𝑠.𝑡. 𝑯 ≥ 0, (𝑭 (𝑣) )𝑇 𝑭 (𝑣) = 𝑰 , 253

(11)

Z. Hu, F. Nie and R. Wang et al.

Information Fusion 55 (2020) 251–259

where F(v) denotes the spectral embedding in view v, H denotes the consistent nonnegative embedding, and reveals the final consistent clustering result. Considering the capacity difference between different views, we learn c different spectral embeddings rather than a consistent spectral embedding for all views, which can improve the robustness of algorithm. Furthermore, following [30], we want to assign a specific weight for each view and recast the Eq. (11) as min

{𝑭 (𝑣) }𝑐𝑣=1 ,𝑯

𝑐 ∑ 𝑣=1

Similarly, the key idea of using inexact-MM to solve the objective of NESE is to find an easy-to-tackle surrogate function for problem (12). Specifically, suppose Ht and {𝑭 (𝑡𝑣) } are the solutions in tth iteration, the surrogate function in 𝑡 + 1th iteration is

{𝑭 (𝑣) }𝑐𝑣=1 ,𝑯

min 𝑯

min 𝑔(𝑥) 𝑥

min 𝑯

(13)

(17)

𝑣=1

𝛿𝑡(𝑣) ‖𝑷 (𝑣) − 𝑯 (𝑭 (𝑡𝑣) )𝑇 ‖2𝐹 (18)

𝑐 ∑ 𝑣=1

𝛿𝑡(𝑣) ‖𝑷 (𝑣) (𝑭 (𝑡𝑣) ) − 𝑯 ‖2𝐹 (19)

Further, by introducing and removing some constant terms, we can reformulate the problem (19) as min ‖ 𝑯

𝑐 ∑ 𝑣=1

𝛿̂𝑡(𝑣) 𝑷 (𝑣) (𝑭 (𝑡𝑣) ) − 𝑯 ‖2𝐹

𝑠.𝑡. 𝑯 ≥ 0,

(14)

(20)

where

Particularly, the surrogate functions {f(x, xt )} of g(x) should satisfy the following two conditions:

𝛿 (𝑣) 𝛿̂𝑡(𝑣) = ∑ 𝑡 . (𝑣) 𝑐 𝑣=1 𝛿𝑡

For any 𝑥 ∈ , we have g(x) ≤ f(x, xt ); For any solution xt returned by 𝑓 (𝑥, 𝑥𝑡−1 ), we have 𝑓 (𝑥𝑡 , 𝑥𝑡 ) = 𝑔(𝑥𝑡 ).

Let 𝑻 𝐻 =

These two conditions imply

∑𝑐

̂ (𝑣) (𝑣) (𝑣) 𝑣=1 𝛿𝑡 𝑷 (𝑭 𝑡 ).

𝑯 𝑡+1 = 𝑚𝑎𝑥(𝟎, 𝑻 𝐻 ) .

(15)

(21) The optimal solution of problem (20) is (22)

Update matrices {𝑭 (𝑣) }𝑐𝑣=1 : Fixing the matrix H as 𝑯 𝑡+1 , we update separately. Specifically, we update each F(v) via solving the following problem

The Eq. (15) further demonstrates that the objective function value of g(x) is non-increasing. Nevertheless, note that in addition to this result, MM does not has any theoretical guarantee, e.g., providing a convergent solution sequence {xt }.

{𝑭 (𝑣) }𝑐𝑣=1

min ‖𝑷 (𝑣) − 𝑯 𝑡+1 (𝑭 (𝑣) )𝑇 ‖2𝐹 𝑭 (𝑣)

4.2. Solve the objective of NESE via inexact-MM

𝑠.𝑡. (𝑭 (𝑣) )𝑇 𝑭 (𝑣) = 𝑰 .

The MM method is not suitable for solving the problem (12), because it has multiple variables, and we cannot find a surrogate function that satisfies the first condition. Hence, we relax the first condition as 𝑔(𝑥𝑡+1 ) ≤ 𝑓 (𝑥𝑡+1 , 𝑥𝑡 ),1 such that the Eq. (15) also holds. We term the new method as inexact-MM. 1

𝑐 ∑

𝑠.𝑡. 𝑯 ≥ 0.

where  denotes the constraint over x. A default assumption is that 𝑔(𝑥) ⟶ ∞ when 𝑥 ∈  and ‖𝑥‖ ⟶ ∞. The MM method is developed to cope with the setting that g(x) is not easy to solve directly, such as g(x) is non-smooth or non-convex. Suppose x0 is an initialization of x. The core of MM is to construct a surrogate function f(x, xt ) that is easy-to-tackle compared with the problem (13), where xt is the solution returned by 𝑓 (𝑥, 𝑥𝑡−1 ). That is,

𝑔(𝑥𝑡+1 ) ≤ 𝑓 (𝑥𝑡+1 , 𝑥𝑡 ) ≤ 𝑓 (𝑥𝑡 , 𝑥𝑡 ) = 𝑔(𝑥𝑡 ) .

.

For (𝑭 (𝑡𝑣) )𝑇 𝑭 (𝑡𝑣) = 𝑰 , where 𝑣 ∈ 1, … , 𝑐 , the above problem is equivalent to

Consider a general optimization problem formulated as:



‖𝑷 (𝑣) − 𝑯 𝑡 (𝑭 (𝑡𝑣) )𝑇 ‖𝐹

𝑠.𝑡. 𝑯 ≥ 0.

4.1. Majorization-Minimization



(16)

It is obvious that 𝛿 (v) can be considered as the weight of view v, and its value depends on the reconstruction error of view v. Such that the weights of all views can be assigned automatically. In Section 4.4, we show that this surrogate function satisfies two conditions of inexact-MM. Next, we focus on solving the problem (16). For problem (16) has multiple variables, and finding the optimal solution is difficult. We choose to use an alternative minimization method which updates one variable while fixes others. Update matrix H: Fixing {𝑭 (𝑣) }𝑐𝑣=1 and removing irrelevant terms, we update H by solving the following problem

In this section, we focus on solving the objective of NESE. Note that directly solving the problem (12) is a challenging task, for it is non-smooth and non-convex. Following [30], we adopt an inexact Majorization-Minimization (MM) method [43]. Before continuing, we provide a brief introduction for MM, which has been ignored by previous studies [4,25,30].

𝑥

1

𝛿𝑡(𝑣) =

4. Optimization of proposed method

𝑥𝑡 = 𝑎𝑟𝑔 min 𝑓 (𝑥, 𝑥𝑡−1 ), 𝑠.𝑡. 𝑥 ∈  .

𝛿𝑡(𝑣) ‖𝑷 (𝑣) − 𝑯 (𝑭 (𝑣) )𝑇 ‖2𝐹

where (12)

In the next section, we show that the weight for each view is related to the approximation error and can be learned adaptively. The Eq. (12) is the final objective of proposed method. Obviously, it is parameter-free, and the clustering result is directly revealed by the negative embedding H. Such that the uncertainty brought by K-means or spectral rotation can be avoided.

𝑠.𝑡. 𝑥 ∈ ,

𝑣=1

𝑠.𝑡. 𝑯 ≥ 0, (𝑭 (𝑣) )𝑇 𝑭 (𝑣) = 𝑰 ,

‖𝑷 (𝑣) − 𝑯 (𝑭 (𝑣) )𝑇 ‖𝐹

𝑠.𝑡. 𝑯 ≥ 0, (𝑭 (𝑣) )𝑇 𝑭 (𝑣) = 𝑰 .

𝑐 ∑

min

(23)

This is the famous orthogonal procrustes problem, and has a close form solution as follows 𝑣) 𝑭 (𝑡+1 = 𝑼𝑽 𝑇 ,

(24)

where USVT is the SVD of (𝑷 (𝑣) )𝑇 𝑯 𝑡+1 . The entire procedure is outlined in Algorithm 1.

Here, x can represent multiple variables. 254

Z. Hu, F. Nie and R. Wang et al.

Information Fusion 55 (2020) 251–259

5. Experiments

Algorithm 1 Solve NESE via inexact-MM. Require: The matrices {𝑷 (𝑣) }𝑐𝑣=1 . Ensure: The consistent nonnegative embedding 𝑯 , and spectral embedding matrices {𝑭 (𝑣) }𝑐𝑣=1 . Initialization: Initialize 𝑭 (𝑣) by the spectral embedding of 𝑣 view. while 𝑡 ≤ 𝑡max do Step 1. Update 𝛿𝑡(𝑣) by Eq. (17); Step 2. Update 𝑯 by Eq. (22); Step 3. Update each 𝑭 (𝑣) by Eq. (24); Step 4. Convergence checking. end while

In order to verify the performance of proposed method NESE, we compare it with a large number of graph-based multi-view clustering methods including CotSC [1], CorSC [2], MLAN [4], SwMC [25], AASC [24], MVGL [27], AMGL [3] and AWP [30]. For all algorithms that require the graph similarity matrices to serve as input, we use the method proposed in [29] to construct the similarity matrix for each view. The reason is that it can avoid the scale difference between different views and generate a normalized similarity matrix for each view. In addition, we report the best result returned by single-view. We conduct all experiments on a Lenovo 5060 PC with 4 Intel i5-4590 Processors, 16G RAM. The experimental environment is Matlab 2016b. The MATLAB codes of all compared methods are provided by authors, and the involved parameters are adjusted by following the authors suggestions. We provide the datasets and the code in https://github.com/sudalvxin/SMSC.git.

4.3. Computation complexity analysis In addition to the Step 3, the main computation cost of Algorithm 1 is matrix multiplications and can be ignored. On the other hand, in Step 3 the computation cost of conducting SVD to matrix {(𝑷 (𝑣) )𝑇 𝑯 }𝑐𝑣=1 is (𝑛𝑘2 ), which is far less than (𝑛3 ) required by previous graph-based algorithms [4,25–27]. In addition, we show in Section 4 that the convergence speed of our method is generally fast (within 100 iterations 𝐶 −𝐶 under convergence criteria 𝑡𝐶−1 𝑡 ≤ 1 × 10−8 , where Ct denotes the ob-

5.1. Datasets Four real datasets are used to verify the performance of NESE. The details of them are summarized as follows. •

𝑡−1

jective function value at tth iteration). 4.4. Convergence analysis



In this section, we provide the convergence analysis over Algorithm 1. We start by introducing a significant Lemma. Lemma 1. Given two positive constants a and b, we have that the following inequality holds. 𝑎−



𝑎2 𝑏2 ≤𝑏− . 2𝑏 2𝑏

(25)

Proof. Given two positive constants a and b, we have



(𝑎 − 𝑏)2 ≥ 0 ⇒ 𝑎2 − 2𝑎𝑏 + 𝑏2 ≥ ⇒ 2𝑎𝑏 − 𝑎2 ≤ 2𝑏2 − 𝑏2 ⇒ 𝑎 −

𝑎2 𝑏2 ≤𝑏− . 2𝑏 2𝑏

(26)

COIL20 [44]: COIL20 is an image dataset which has 1440 images consisting of 20 groups, and each group has 72 data points. For each image, we extract three different feature vectors including: 1024-D intensity feature, 3304-D LBP feature, and 6750 Gabor feature. NUS [45]: NUS has 2400 images extracted from NUS-WIDE. All images are partitioned into 12 groups. For each image, we extract six different feature vectors including: 64-D color histogram, 144-D colour moment, 73-D edge direction histogram, 128-D wavelet texture, and 255-D SIFT descriptions. ORL [46]: ORL is a face dataset which has 400 images consisting of 10 groups. For each image, we extract four different feature vectors: 512-D GIST, 59-D LBP, 864-D HOG, and 254-Centrist feature. Outdoor Scene [47]: The dataset has 2688 images consisting of 8 groups. For each image, we extract four different feature vectors including 512-D GIST, 432-D color moment, 256-D HOG, and 48-D LBP.

□ 5.2. Evaluation metrics

The theoretical result w.r.t Algorithm 1 is as follows.

For all algorithms, the clustering result is measured by six evaluation matrices: ACC (Clustering Accuracy), NMI (Normalized Mutual Information), Purity, Precision, F-score, and ARI (Adjusted Rand Index). All of them have been widely used in previous studies [18,19,26,27]. The definitions can be found in [26]. Particularly, for each evaluation matrix the larger values represent the better results.

Theorem 1. The Algorithm 1 will decrease the objective function value of problem (12) iteratively. That is, 𝑐 ∑ 𝑣=1

‖𝑷 (𝑣) − 𝑯 𝑡+1 (𝑷 𝑡+1 (𝑣) )𝑇 ‖𝐹 ≤

𝑐 ∑ 𝑣=1

‖𝑷 (𝑣) − 𝑯 𝑡 (𝑭 (𝑡𝑣) )𝑇 ‖𝐹 ,

(27)

where Ht and {𝑭 𝑡 (𝑣) }𝑐𝑣=1 are the solutions returned by Algorithm 1 in tth iteration.

5.3. Performance evaluation

Proof. According the definitions of Ht , {Pt (v) } and {𝛿𝑡(𝑣) }, we know that 𝑐 ∑ ‖𝑷 (𝑣) − 𝑯 𝑡+1 (𝑭 𝑡+1 (𝑣) )𝑇 ‖2𝐹

2‖𝑷 (𝑣) − 𝑯 𝑡 (𝑭 (𝑡𝑣) )𝑇 ‖𝐹

𝑣=1



𝑐 ∑ ‖𝑷 (𝑣) − 𝑯 𝑡 (𝑭 (𝑡𝑣) )𝑇 ‖2𝐹 𝑣=1

2‖𝑷 (𝑣) − 𝑯 𝑡 (𝑭 (𝑡𝑣) )𝑇 ‖𝐹

.

For each test data, we run all algorithms 30 times, and calculate the mean values and stand error. The clustering results returned by all algorithms are reported in Table 2, where SC-best denotes the best result achieved by single-view. Table 2 shows that NESE performs better than others on all datasets except for COIL20. Although SwMC outperforms NESE on COIL20, the accuracy gap between them is very small. And, the superiority of NESE on residual datasets are dominant. In addition, we show in the next subsection that the time cost of NESE is far less than SwMC. The performance of AWP is worse than our method NESE, because it adopts a two-steps mechanism to learn the consistent clustering result. Specifically, AWP first learns the spectral embeddings of different views separately, and then learns a consistent clustering result via spectral rotation. The first step may generate sub-optimal spectral embeddings, and further degenerate the final clustering result. While NESE integrates nonnegative embedding and spectral embedding into a unified framework, and learns the consistent clustering directly. CorSC

(28)

The Lemma 1 demonstrates that 𝑐 ∑ 𝑣=1



‖𝑷 (𝑣) − 𝑯 𝑡+1 (𝑭 𝑡+1 (𝑣) )𝑇 ‖𝐹 −

𝑐 ∑ 𝑣=1

‖𝑷 (𝑣) − 𝑯 𝑡 (𝑭 (𝑡𝑣) )𝑇 ‖𝐹 −

𝑐 ∑ ‖𝑷 (𝑣) − 𝑯 𝑡+1 (𝑭 𝑡+1 (𝑣) )𝑇 ‖2𝐹 𝑣=1

2‖𝑷 (𝑣) − 𝑯 𝑡 (𝑭 (𝑡𝑣) )𝑇 ‖𝐹

𝑐 ∑ ‖𝑷 (𝑣) − 𝑯 𝑡 (𝑭 (𝑡𝑣) )𝑇 ‖2𝐹 𝑣=1

2‖𝑷 (𝑣) − 𝑯 𝑡 (𝑭 (𝑡𝑣) )𝑇 ‖𝐹

.

(29)

Combing the Eqs. (28) and (29), we achieve 𝑐 ∑ 𝑣=1

𝑣) 𝑇 ‖𝑷 (𝑣) − 𝑯 𝑡+1 (𝑭 (𝑡+1 ) ‖𝐹 ≤

𝑐 ∑ 𝑣=1

‖𝑷 (𝑣) − 𝑯 𝑡 (𝑭 (𝑡𝑣) )𝑇 ‖𝐹 .

(30)

The Eq. (30) completes our proof. Particularly, the Eqs. (28) and (16) correspond to two conditions of inexact-MM respectively. □ 255

Z. Hu, F. Nie and R. Wang et al.

Information Fusion 55 (2020) 251–259

Table 2 The experimental results on real datasets. We highlight the best results in bold. Data

Algorithm

ACC

NMI

Purity

F-score

Precision

ARI

COIL20

SC-Best AWP MLAN SwMC AMGL AASC MVGL CorSC CotSC NESE

0.73( ± 0.01) 0.68( ± 0.00) 0.84( ± 0.00) 0.86( ± 0.00) 0.80( ± 0.04) 0.79( ± 0.00) 0.78( ± 0.00) 0.68( ± 0.04) 0.70( ± 0.03) 0.85( ± 0.00)

0.82( ± 0.01) 0.87( ± 0.00) 0.92( ± 0.00) 0.94( ± 0.00) 0.91( ± 0.02) 0.89( ± 0.00) 0.88( ± 0.00) 0.78( ± 0.02) 0.80( ± 0.02) 0.92( ± 0.00)

0.75( ± 0.01) 0.75( ± 0.00) 0.88( ± 0.00) 0.90( ± 0.00) 0.85( ± 0.03) 0.83( ± 0.00) 0.81( ± 0.00) 0.70( ± 0.03) 0.72( ± 0.03) 0.88( ± 0.00)

0.69( ± 0.01) 0.73( ± 0.00) 0.82( ± 0.00) 0.84( ± 0.00) 0.76( ± 0.07) 0.77( ± 0.00) 0.76( ± 0.00) 0.64( ± 0.03) 0.67( ± 0.03) 0.83( ± 0.00)

0.67( ± 0.00) 0.60( ± 0.00) 0.73( ± 0.00) 0.76( ± 0.00) 0.64( ± 0.09) 0.73( ± 0.00) 0.70( ± 0.00) 0.62( ± 0.04) 0.64( ± 0.04) 0.80( ± 0.00)

0.68( ± 0.02) 0.71( ± 0.00) 0.81( ± 0.00) 0.84( ± 0.00) 0.74( ± 0.07) 0.76( ± 0.00) 0.75( ± 0.00) 0.62( ± 0.03) 0.65( ± 0.03) 0.82( ± 0.00)

NUS

SC-Best AWP MLAN SwMC AMGL AASC MVGL CorSC CotSC NESE

0.21( ± 0.01) 0.28( ± 0.00) 0.25( ± 0.00) 0.15( ± 0.00) 0.25( ± 0.01) 0.25( ± 0.00) 0.15( ± 0.00) 0.27( ± 0.01) 0.29( ± 0.01) 0.32( ± 0.00)

0.09( ± 0.01) 0.15( ± 0.00) 0.15( ± 0.00) 0.08( ± 0.00) 0.13( ± 0.01) 0.13( ± 0.00) 0.07( ± 0.00) 0.14( ± 0.01) 0.16( ± 0.01) 0.17( ± 0.00)

0.21( ± 0.01) 0.29( ± 0.00) 0.26( ± 0.00) 0.17( ± 0.00) 0.27( ± 0.01) 0.27( ± 0.00) 0.16( ± 0.00) 0.29( ± 0.01) 0.30( ± 0.01) 0.34( ± 0.00)

0.15( ± 0.01) 0.17( ± 0.00) 0.17( ± 0.00) 0.16( ± 0.00) 0.17( ± 0.01) 0.15( ± 0.00) 0.16( ± 0.00) 0.16( ± 0.01) 0.17( ± 0.01) 0.19( ± 0.00)

0.13( ± 0.00) 0.15( ± 0.00) 0.10( ± 0.00) 0.09( ± 0.00) 0.13( ± 0.01) 0.13( ± 0.00) 0.09( ± 0.00) 0.16( ± 0.01) 0.16( ± 0.01) 0.17( ± 0.00)

0.07( ± 0.02) 0.09( ± 0.00) 0.04( ± 0.00) 0.01( ± 0.00) 0.07( ± 0.01) 0.06( ± 0.00) 0.01( ± 0.00) 0.09( ± 0.01) 0.09( ± 0.01) 0.11( ± 0.00)

ORL

SC-Best AWP MLAN SwMC AMGL AASC MVGL CorSC CotSC NESE

0.66( ± 0.02) 0.80( ± 0.00) 0.78( ± 0.00) 0.77( ± 0.00) 0.75( ± 0.02) 0.82( ± 0.02) 0.75( ± 0.00) 0.77( ± 0.03) 0.75( ± 0.04) 0.84( ± 0.00)

0.76( ± 0.02) 0.91( ± 0.00) 0.88( ± 0.00) 0.90( ± 0.00) 0.90( ± 0.02) 0.91( ± 0.01) 0.88( ± 0.00) 0.90( ± 0.01) 0.87( ± 0.01) 0.92( ± 0.00)

0.71( ± 0.02) 0.83( ± 0.00) 0.82( ± 0.00) 0.83( ± 0.00) 0.82( ± 0.02) 0.85( ± 0.01) 0.80( ± 0.00) 0.82( ± 0.03) 0.78( ± 0.03) 0.87( ± 0.00)

0.68( ± 0.02) 0.77( ± 0.00) 0.68( ± 0.00) 0.63( ± 0.00) 0.64( ± 0.08) 0.77( ± 0.02) 0.57( ± 0.00) 0.73( ± 0.03) 0.68( ± 0.03) 0.78( ± 0.00)

0.57( ± 0.01) 0.69( ± 0.00) 0.60( ± 0.00) 0.49( ± 0.00) 0.52( ± 0.10) 0.73( ± 0.03) 0.43( ± 0.00) 0.67( ± 0.04) 0.63( ± 0.04) 0.74( ± 0.00)

0.67( ± 0.01) 0.76( ± 0.00) 0.67( ± 0.00) 0.62( ± 0.00) 0.63( ± 0.09) 0.76( ± 0.02) 0.55( ± 0.00) 0.72( ± 0.04) 0.67( ± 0.03) 0.78( ± 0.00)

Out-Scene

SC-best AWP MLAN SwMC AMGL AASC MVGL CorSC CotSC NESE

0.47( ± 0.01) 0.65( ± 0.00) 0.55( ± 0.02) 0.50( ± 0.00) 0.51( ± 0.05) 0.60( ± 0.00) 0.42( ± 0.00) 0.51( ± 0.04) 0.38( ± 0.02) 0.72( ± 0.00)

0.39( ± 0.01) 0.51( ± 0.00) 0.47( ± 0.01) 0.47( ± 0.00) 0.45( ± 0.03) 0.48( ± 0.00) 0.31( ± 0.00) 0.39( ± 0.03) 0.22( ± 0.01) 0.55( ± 0.00)

0.57( ± 0.01) 0.65( ± 0.00) 0.55( ± 0.02) 0.50( ± 0.00) 0.52( ± 0.04) 0.60( ± 0.00) 0.43( ± 0.00) 0.52( ± 0.03) 0.39( ± 0.02) 0.72( ± 0.00)

0.37( ± 0.01) 0.49( ± 0.00) 0.45( ± 0.02) 0.49( ± 0.00) 0.45( ± 0.04) 0.44( ± 0.00) 0.32( ± 0.00) 0.40( ± 0.01) 0.26( ± 0.02) 0.55( ± 0.00)

0.36( ± 0.01) 0.48( ± 0.00) 0.33( ± 0.03) 0.36( ± 0.00) 0.34( ± 0.04) 0.39( ± 0.00) 0.21( ± 0.00) 0.39( ± 0.02) 0.26( ± 0.01) 0.53( ± 0.00)

0.34( ± 0.01) 0.42( ± 0.00) 0.33( ± 0.03) 0.38( ± 0.00) 0.34( ± 0.05) 0.35( ± 0.00) 0.16( ± 0.00) 0.31( ± 0.02) 0.16( ± 0.01) 0.48( ± 0.00)

Table 3 The time cost (s) of all algorithms.

and CotSC are two methods that completely ignore the quality difference between different views, so its performance is worse than the best single-view clustering on some datasets. Both AMGL and AASC assign a specific weight for each view, but it is not enough for tackling with the quality difference among different similarity matrices. Because, using FFT to simultaneously approximate multiple matrices may learn inferior spectral embedding. Contrarily, NESE enables different views to have different spectral embeddings rather than share a common embedding. Therefore, NESE achieves a better clustering result. The uncertainty brought by post-processing is another factor leading to the gap between AMGL (AASC) and NESE. In addition, we fix the initialization variables of both AWP and NESE, hence the stand errors of them are 0. For both MVGL and SwMC, the clustering result is reflected by a block diagonal matrix, so the stand errors of them are also 0. For residual algorithms including AMGL, CorSC, and CotSC, the clustering results are generally obtained by conducting K-means on consistent spectral embedding, so the performance of them heavily depends on the initialization of K-means (The stand error was translated as 0 when its value is less than 10−4 ). Contrarily, the performance of NESE is stable, because it fix all inputs.

Data

COIL20

NUS

ORL

Out-scene

AWP MLAN SwMC AMGL AASC MVGL CorSC CotSC NESE

4.13 14.98 707.40 5.50 13.81 150.82 150.00 178.43 13.01

17.23 53.39 2415.83 35.76 23.22 801.97 299.22 727.58 29.92

0.50 1.68 4.79 0.41 4.49 9.72 57.65 76.41 1.30

14.70 56.84 3611.09 24.73 12.82 677.42 282.55 541.15 23.31

Fig. 1 demonstrates that NESE generally converges within 100 iterations. In practice, the maximum number of iterations is fixed as 100. Observing the Table 3, we can find that the time cost of AWP is less than NESE on all test datasets due to its advantage in convergence speed. Theoretically, however, the computation complexity of AWP is closely equivalent to NESE. In addition, for the convergence speed of both AMGL and AASC is generally fast, their time cost is also less than NESE on some datasets. Nevertheless, as reported in Table 2, NESE outperforms these approaches in terms of clustering precision. In particular, compared with SwMC, CorSC, CotSC, and MVGL, the advantage of NESE is dominant on large scale data, such as NUS and Out-Scene.

5.4. Computation cost In this subsection, we first investigate the convergence speed of NESE and report the result in Fig. 1. And then, we compare it with alternative methods and report the time costs of all algorithms in Table 3. The 256

Z. Hu, F. Nie and R. Wang et al.

Information Fusion 55 (2020) 251–259

Fig. 1. The convergence speed of NESE.

Fig. 2. Four clusters provided by NESE. These clusters consists of seven different persons images. Note that the real number of samples of Cluster 1 (Cluster 4) is larger than 10.

Fig. 3. tSNE of ORL dataset. The clustering results are reported in Fig. 2.

257

Z. Hu, F. Nie and R. Wang et al.

Information Fusion 55 (2020) 251–259

5.5. Visualization

[8] D. Greene, P. Cunningham, A matrix factorization approach for integrating multiple data views, in: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2009, pp. 423–438. [9] X. Cai, F. Nie, H. Huang, Multi-view k-means clustering on big data, in: Proceedings of the International Joint Conference on Artificial Intelligence, 2013, pp. 2598–2604. [10] H. Zhao, Z. Ding, Y. Fu, Multi-view clustering via deep matrix factorization, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2017, pp. 2921–2927. [11] Z. Wang, X. Kong, H. Fu, M. Li, Y. Zhang, Feature extraction via multi-view non-negative matrix factorization with local graph regularization, in: Proceedings of the IEEE International Conference on Image Processing, IEEE, 2015, pp. 3500–3504. [12] B. Zhao, J.T. Kwok, C. Zhang, Multiple kernel clustering, in: Proceedings of the SIAM International Conference on Data Mining, SIAM, 2009, pp. 638–649. [13] L. Du, P. Zhou, L. Shi, H. Wang, M. Fan, W. Wang, Y.-D. Shen, Robust multiple kernel k-means using l21-norm, in: Proceedings of the International Joint Conference on Artificial Intelligence, 2015, pp. 3476–3482. [14] M. Gönen, A.A. Margolin, Localized data fusion for kernel k-means clustering with application to cancer biology, in: Proceedings of the Advances in Neural Information Processing Systems, 2014, pp. 1305–1313. [15] K. Chaudhuri, S.M. Kakade, K. Livescu, K. Sridharan, Multi-view clustering via canonical correlation analysis, in: Proceedings of the Annual International Conference on Machine Learning, ACM, 2009, pp. 129–136. [16] Y. Guo, Convex subspace representation learning from multi-view data, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2013, pp. 523–530. [17] T. Xia, D. Tao, T. Mei, Y. Zhang, Multiview spectral embedding, IEEE Trans. Syst. Man Cybern.Part B (Cybernetics) 40 (6) (2010) 1438–1446. [18] C. Zhang, H. Fu, S. Liu, G. Liu, X. Cao, Low-rank tensor constrained multiview subspace clustering, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1582–1590. [19] X. Cao, C. Zhang, H. Fu, S. Liu, H. Zhang, Diversity-induced multi-view subspace clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 586–594. [20] H. Gao, F. Nie, X. Li, H. Huang, Multi-view subspace clustering, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4238–4246. [21] J. Xu, J. Han, F. Nie, X. Li, Re-weighted discriminatively embedded k-means for multi-view clustering, IEEE Trans. Image Process. 26 (6) (2017) 3016–3027. [22] Z. Akata, C. Thurau, C. Bauckhage, Non-negative matrix factorization in multimodality data for segmentation and label prediction, in: Proceedings of the Computer Vision Winter Workshop, 2011. [23] D. Kuang, C. Ding, H. Park, Symmetric nonnegative matrix factorization for graph clustering, in: Proceedings of the SIAM international Conference on Data Mining, SIAM, 2012, pp. 106–117. [24] H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Affinity aggregation for spectral clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 773–780. [25] F. Nie, J. Li, X. Li, Self-weighted multiview clustering with multiple graphs, in: Proceedings of the International Joint Conference on Artificial Intelligence, 2017, pp. 2564–2570. [26] K. Zhan, F. Nie, J. Wang, Y. Yang, Multiview consensus graph clustering, IEEE Trans. Image Process. 28 (3) (2019) 1261–1270. [27] K. Zhan, C. Zhang, J. Guan, J. Wang, Graph learning for multiview clustering, IEEE Trans. Cybern. (99) (2017) 1–9. [28] F. Nie, X. Wang, H. Huang, Clustering and projected clustering with adaptive neighbors, in: Proceedings of the ACM SIGKDD International conference on Knowledge Discovery and Data Mining, ACM, 2014, pp. 977–986. [29] F. Nie, X. Wang, M.I. Jordan, H. Huang, The constrained laplacian rank algorithm for graph-based clustering, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2016, pp. 1969–1976. [30] F. Nie, L. Tian, X. Li, Multiview clustering via adaptively weighted procrustes, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, 2018, pp. 2022–2030. [31] C. Zhang, Q. Hu, H. Fu, P. Zhu, X. Cao, Latent multi-view subspace clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4279–4287. [32] Y. Li, F. Nie, H. Huang, J. Huang, Large-scale multi-view spectral clustering via bipartite graph., in: Proceedings of the AAAI Conference on Artificial Intelligence, 2015, pp. 2750–2756. [33] W. Liu, J. He, S.-F. Chang, Large graph construction for scalable semi-supervised learning, in: Proceedings of the International Conference on Machine Learning, 2010, pp. 679–686. [34] Z. Zhang, L. Liu, J. Qin, F. Zhu, F. Shen, Y. Xu, L. Shao, H. Tao Shen, Highly-economized multi-view binary compression for scalable image clustering, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 717–732. [35] J. Wen, Z. Zhang, Y. Xu, Z. Zhong, Incomplete multi-view clustering via graph regularized matrix factorization, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 593–608. [36] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach.Intell. 22 (8) (2000) 888–905. [37] C. Ding, X. He, H.D. Simon, On the equivalence of nonnegative matrix factorization and spectral clustering, in: Proceedings of the SIAM International Conference on Data Mining, SIAM, 2005, pp. 606–610. [38] K. Allab, L. Labiod, M. Nadif, Simultaneous spectral data embedding and clustering, IEEE Trans. Neural Netw. Learn.Syst. (2018). [39] R. Wang, F. Nie, Z. Wang, F. He, L. Xuelong, Scalable graph-based clustering with nonnegative relaxation for large hyperspectral image, IEEE Trans. Geosci. Remote Sens. (2019).

To understand the decisions of the algorithm, in this subsection we visualize the clustering result of NESE. We conduct NESE to partition ORL dataset into 40 different clusters. Due to the space limitation, we show only 4 clusters in Fig. 2, where 10 samples are provided per cluster. Note that these clusters consist of 7 different persons, while the real number of samples in Cluster 1 (Cluster 4) is larger than 10. Besides, we use tSNE to visualize all views of ORL, and report the results in Fig. 3. Observing the Fig. 3, we can find that Person 3 and Person 4 have relatively apparent group structure in view 1 and fuzzy group structure in other views, but both of them are correctly grouped due to using multiple views information simultaneously. Some samples of Person 4, Person 5 and Person 6 are entangled in all views, and grouped into the same class by NESE. From the perspective of human, the difference between Person 1 and Person 2 is distinct, but some samples of them are closing in feature space and grouped into the same class by our method. 6. Conclusion This work provided a novel method, namely NESE, for multi-view spectral clustering. The core idea of NESE is to learn a consistent nonnegative embedding and multiple spectral embeddings simultaneously. In particular, the nonnegative embedding directly reveals the consistent clustering result we desired. Furthermore, an inexact-MM method is developed to solve the involved objective. Numerous experimental results demonstrate the promising empirical performance of NESE. For the subproblem solved in inexact-MM is non-convex, we have to adopt an alternative minimization method to solve it, which slows the convergence speed of inexact-MM. Hence, we focus on solving this issue in our future work. Acknowledgments This work was supported in part by the National Natural Science Foundation of China under Grant 61772427, Grant 61751202 and Grant 61936014, in part by the National Key Research and Development Program of China under Grant 2018YFB1403500, and in part by the Fundamental Research Funds for the Central Universities under Grant G2019KY0501. Supplementary material Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.inffus.2019.09.005. References [1] A. Kumar, H. Daumé, A co-training approach for multi-view spectral clustering, in: Proceedings of the International Conference on Machine Learning, 2011, pp. 393–400. [2] A. Kumar, P. Rai, H. Daume, Co-regularized multi-view spectral clustering, in: Proceedings of the Advances in Neural Information Processing Systems, 2011, pp. 1413–1421. [3] F. Nie, J. Li, X. Li, et al., Parameter-free auto-weighted multiple graph learning: a framework for multiview clustering and semi-supervised classification., in: Proceedings of the International Joint Conference on Artificial Intelligence, 2016, pp. 1881–1887. [4] F. Nie, G. Cai, X. Li, Multi-view clustering and semi-supervised classification with adaptive neighbours., in: Proceedings of the AAAI Conference on Artificial Intelligence, 2017, pp. 2408–2414. [5] C. Zhang, H. Fu, Q. Hu, X. Cao, Y. Xie, D. Tao, D. Xu, Generalized latent multi-view subspace clustering, IEEE Trans. Pattern Anal. Mach.Intell. (2018). [6] R. Wang, F. Nie, Z. Wang, H. Hu, X. Li, Parameter-free weighted multi-view projected clustering with structured graph learning, IEEE Trans. Knowl. Data Eng. (2019). [7] J. Liu, C. Wang, J. Gao, J. Han, Multi-view clustering via joint nonnegative matrix factorization, in: Proceedings of the SIAM International Conference on Data Mining, SIAM, 2013, pp. 252–260.

258

Z. Hu, F. Nie and R. Wang et al.

Information Fusion 55 (2020) 251–259

[40] R. Arora, M. Gupta, A. Kapila, M. Fazel, Clustering by left-stochastic matrix factorization, in: Proceedings of the International Conference on Machine Learning, Citeseer, 2011, pp. 761–768. [41] Z. Yang, T. Hao, O. Dikmen, X. Chen, E. Oja, Clustering by nonnegative matrix factorization using graph random walk, in: Proceedings of the Advances in Neural Information Processing Systems, 2012, pp. 1079–1087. [42] J. Han, K. Xiong, F. Nie, Orthogonal and nonnegative graph reconstruction for large scale clustering, in: Proceedings of the International Joint Conference on Artificial Intelligence, 2017, pp. 1809–1815. [43] Y. Sun, P. Babu, D.P. Palomar, Majorization-minimization algorithms in signal processing, communications, and machine learning, IEEE Trans. Signal Process. 65 (3) (2017) 794–816.

[44] S.A. Nene, S.K. Nayar, H. Murase, et al., Columbia object image library (coil20)(1996). [45] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, Nus-wide: a real-world web image database from national university of singapore, in: Proceedings of the ACM International Conference on Image and Video Retrieval, ACM, 2009, p. 48. [46] F.S. Samaria, A.C. Harter, Parameterisation of a stochastic model for human face identification, in: Proceedings of the Second IEEE Workshop on Applications of Computer Vision, IEEE, 1994, pp. 138–142. [47] A. Monadjemi, B. Thomas, M. Mirmehdi, Experiments on high resolution images towards outdoor scene classification, Technical Report, tech. rep., University of Bristol, Department of Computer Science, 2002.

259