Graph Regularized Nonnegative Matrix Factorization with Sample Diversity for Image Representation

Graph Regularized Nonnegative Matrix Factorization with Sample Diversity for Image Representation

Engineering Applications of Artificial Intelligence 68 (2018) 32–39 Contents lists available at ScienceDirect Engineering Applications of Artificial...

1MB Sizes 0 Downloads 97 Views

Engineering Applications of Artificial Intelligence 68 (2018) 32–39

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai

Graph Regularized Nonnegative Matrix Factorization with Sample Diversity for Image Representation Changpeng Wang a, *, Xueli Song a , Jiangshe Zhang b a b

School of Mathematics and Information Science, Chang’an University, Xi’an 710064, China School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China

a r t i c l e

i n f o

Keywords: Nonnegative matrix factorization Semi-supervised learning Sample diversity Clustering

a b s t r a c t Nonnegative Matrix Factorization (NMF) is an effective algorithm for dimensionality reduction and feature extraction in data mining and computer vision. It incorporates the nonnegativity constraints into the factorization, and thus obtains a parts-based representation. However, the existing NMF variants cannot fully utilize the limited label information and neglect the unlabeled sample diversity. Therefore, we propose a novel NMF method, called Graph Regularized Nonnegative Matrix Factorization with Sample Diversity (GNMFSD), which make use of the label information and sample diversity to facilitate the representation learning. Specifically, it firstly incorporates a graph regularization term that encode the intrinsic geometrical information. Moreover, two reconstruction regularization terms based on labeled samples and virtual samples are also presented, which potentially improve the new representations to be more discriminative and effective. The iterative updating optimization scheme is developed to solve the objective function of GNMFSD and the convergence of our scheme is also proven. The experiment results on standard image databases verify the effectiveness of our proposed method in image clustering. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction Matrix factorization is a useful tool for data representation as well as dimensionality reduction, and various matrix factorization based algorithms have been well studied. Perhaps, most well known matrix factorization based algorithms are Principal Component Analysis (PCA) (Jolliffe, 1989), Independent Component Analysis (Hyvarinen and Oja, 2000), Linear Discriminant Analysis (LDA) (Belhumeur et al., 1997), Local Linear Embedding (LLE) (Chen and Liu, 2011), and Nonnegative Matrix Factorization (NMF) (Lee and Seung, 1999). These methods aim to learn a compact low-dimensional representation of original data for further applications. NMF is an unsupervised and effective analysis algorithm due to its theoretical interpretation and desired performance (Wang and Zhang, 2013). It aims to find a linear approximation to the original matrix by basis matrix and coefficient feature matrix, and simultaneously enforces the elements in both basis vectors and representation coefficients to be nonnegative. This constraint allows NMF with additive combination to approximation to the original data which accords with the cognitive process of human brain (Wachsmuth et al., 1994; Logothetis and Sheinberg, 1996). Thus, NMF and its variants have been widely used in * Corresponding author.

E-mail address: [email protected] (C. Wang). https://doi.org/10.1016/j.engappai.2017.10.018 Received 10 April 2017; Received in revised form 28 August 2017; Accepted 22 October 2017 0952-1976/© 2017 Elsevier Ltd. All rights reserved.

different real-world applications, such as audiovisual document structuring (Essid and Fevotte, 2013), hyperspectral image analysis (Gillis and Plemmons, 2013), graph matching problem (Jiang et al., 2014), maintenance activities identification (Feng et al., 2016), face recognition (Zhi et al., 2011; Chen et al., 2016), data clustering (Li et al., 2014; Lu and Miao, 2016), etc. A comprehensive review about the theoretical research of NMF can be found in Wang and Zhang (2013). Despite NMF has solid mathematical theory foundations and encouraging performances, it still have some limitations including neglecting intrinsic geometric structure of the data and lacking discriminative information for clustering. Recently, several variants of NMF have been proposed to improve the performance. Graph Regularized NMF (GNMF) (Cai et al., 2011) utilizes the Laplacian graph as a regularization term to exploit the locality property of the data. Dual GNMF (Shang et al., 2012) and Multiple GNMF (Wang et al., 2013) have been proposed by adding more constraints to the original objective function. To use the supervised information, Constrained NMF (CNMF) (Liu et al., 2011) encodes label information into NMF to make the data points share the same label in the new representation process. However, this constraint ignores the local geometrical structure of data set.

C. Wang et al.

Engineering Applications of Artificial Intelligence 68 (2018) 32–39

basis vectors. In order to measure the quality of the decomposition, the Euclidean distance based objective function is expressed as:

Furthermore, Graph Regularized and Sparse NMF with hard Constraints (GSNMFC) (Sun et al., 2016) integrates the geometrical structure and label information as well as sparse constraint in a joint framework. However, in GSNMFC, the unlabeled data are not fully utilized due to no constraints on them, and the sparsity constraint leads the intrasample structure information to be ignored during the learning process (Lu and Miao, 2016). Therefore, it is necessary to establish a semi-supervised framework for NMF which can fully utilize the label information and the diversity of unlabeled data simultaneously. In this paper, we propose a novel NMF method, called Graph Regularized Nonnegative Matrix Factorization with Sample Diversity (GNMFSD), which adds three constraint conditions to ensure the effectiveness of the obtained representations. Our algorithm is motivated by recent study on dictionary learning, and particularly, sample diversity learning proposed by Xu et al. (2017). In our method, we incorporate label information into the graph to encode the intrinsic geometrical structures of the data space, and linear regression term based on the labeled samples to promote the discriminant power of the learned basis vectors. Furthermore, sample diversity term is used to try to obtain more effective representations of unlabeled data. By combining these three constraints with the graph-based regularizer, we expect to fully utilize the unlabeled data and the limited label information to improve the clustering performance. In addition, we discuss how to solve the corresponding optimization problem, and theoretically prove that our objective function is nonincreasing under the corresponding update rules. Finally, we conduct extensive experiments to validate the effectiveness of our GNMFSD method. The main contributions of this paper are summarized as follows:

min ‖𝑋 − 𝑈 𝑉 𝑇 ‖2𝐹 𝑈 ,𝑉

s.t.

𝑈 ≥ 0, 𝑉 ≥ 0

(2)

where ‖ ⋅ ‖𝐹 denotes the matrix Frobenius norm. Because the objective function is not convex in both 𝑈 and 𝑉 , it is difficult to obtain the global minimum of the objective function. Lee and Seung (1999) provided update rules to obtain a local minimum and proved its convergence. The update rules for the objective function are given as: 𝑢𝑖𝑘 ← 𝑢𝑖𝑘

(𝑋 𝑇 𝑈 )𝑗𝑘 (𝑋𝑉 )𝑖𝑘 , 𝑣 ← 𝑣 . 𝑗𝑘 𝑗𝑘 (𝑈 𝑉 𝑇 𝑉 )𝑖𝑘 (𝑉 𝑈 𝑇 𝑈 )𝑗𝑘

(3)

The iterative update procedure is executed repeatedly to decrease the approximation error, and final 𝑈 and 𝑉 are obtained when the given terminal condition is met. By NMF, each data point 𝑥𝑖 is approximated by a linear combination of the columns of 𝑈 , weighted by the 𝑖th column of 𝑉 . The NMF has achieved good results in many practical applications due to its effectiveness and simple to implement. 2.2. GNMF As aforementioned, NMF learns a part-based representation in Euclidean space, but it neglects the intrinsic geometric structure of the original data. In order to preserve this inherent structure information during the matrix decomposition, Cai et al. (2011) proposed a Graph Regularized Nonnegative Matrix Factorization (GNMF) algorithm. In GNMF, a nearest-neighbor graph is constructed to preserve the geometrical structure of data space, and the objective function of GNMF can be formulated as below:

1. The proposed GNMFSD method do not only take the diversity of unlabeled samples into account, but also characterize both the underlying local geometrical information and the global discriminative information of the samples with additional regularization terms. Hence, the learned effective representations have more discriminating power for the data representation, which could improve the performance on clustering accuracy and normalized mutual information. 2. The updating rules to solve the objective function and the convergence proof are provided. Experiments on the real databases are conducted to demonstrate the algorithm effectiveness quantitatively.

min ‖𝑋 − 𝑈 𝑉 𝑇 ‖2𝐹 + 𝜆𝑇 𝑟(𝑉 𝑇 𝐿𝑉 ) 𝑈 ,𝑉

s.t.

𝑈 ≥ 0, 𝑉 ≥ 0

(4)

where 𝐿 = 𝐷−𝑊 is Laplacian matrix, 𝑊 is the weight matrix to measure the similarity between the nearby data points, 𝐷 is a diagonal matrix ∑ whose entries are column (or row) sums of 𝑊 , 𝐷𝑗𝑗 = 𝑙 𝑊𝑗𝑙 . The 𝜆 is a regularization parameter which balance the reconstruction error and manifold term. The update rules to solve (4) are given below: 𝑢𝑖𝑘 ← 𝑢𝑖𝑘

The rest of this paper is organized as follows. Section 2 reviews the related works including NMF, GNMF, CNMF. In Section 3, we introduce our proposed algorithm, as well as the optimization scheme and convergence study. Experimental results are presented for illustration in Section 4 and Section 5 concludes this paper.

(𝑋 𝑇 𝑈 + 𝜆𝑊 𝑉 )𝑗𝑘 (𝑋𝑉 )𝑖𝑘 , 𝑣𝑗𝑘 ← 𝑣𝑗𝑘 . 𝑇 (𝑈 𝑉 𝑉 )𝑖𝑘 (𝑉 𝑈 𝑇 𝑈 + 𝜆𝐷𝑉 )𝑗𝑘

(5)

With the local manifold learning in GNMF, the nearby data points are encouraged to be as close as possible in the new data space. 2.3. CNMF

2. Related work

In order to take advantage of the partial labeled data, CNMF (Liu et al., 2011) introduces a label constraint matrix 𝐴 and takes the label information of data into account. Suppose there are 𝑐 classes, the first 𝑙 data points 𝑥1 , … , 𝑥𝑙 are labeled, and the rest of the 𝑛 − 𝑙 data points 𝑥𝑙+1 , … , 𝑥𝑛 are unlabeled. The 𝑙 × 𝑐 indicator matrix 𝐶 is defined as below: { 1, if 𝑥𝑖 is labeled with the 𝑗th class 𝑐𝑖𝑗 = (6) 0, otherwise.

In this section, we briefly discuss the important algorithms which are relevant to our work, including NMF (Lee and Seung, 1999), Graph Regularized NMF (GNMF) (Cai et al., 2011) and Constrained NMF (CNMF) (Liu et al., 2011). 2.1. NMF Nonnegative matrix factorization (NMF) is linear model that learns a part-based representation of the data. it aims to find two nonnegative matrices 𝑈 ∈ R𝑚×𝑘 and 𝑉 ∈ R𝑛×𝑘 to approximate the original data matrix 𝑋 = [𝑥1 , … , 𝑥𝑛 ] ∈ R𝑚×𝑛 :

With the indicator matrix 𝐶, the label constraint matrix 𝐴 can be defined as below: [ ] 𝐶 0 𝐴 = 𝑙×𝑐 (7) 0 𝐼𝑛−𝑙

𝑋 ≈ 𝑈𝑉 𝑇 .

where 𝐼𝑛−𝑙 is an (𝑛 − 𝑙) × (𝑛 − 𝑙) identity matrix. With the introduced matrices, the original data 𝑋 is approximated as 𝑋 ≈ 𝑈 𝑉 𝑇 = 𝑈 (𝐴𝑍)𝑇 . The indicator matrix 𝐶 means that if samples 𝑖 and 𝑗 have the same label, then their weighted coefficient vector are also same. The CNMF

(1)

In the above representation, 𝑈 can be considered as a set of basis vectors and 𝑉 as the representation of each sample with respect to these 33

C. Wang et al.

Engineering Applications of Artificial Intelligence 68 (2018) 32–39

We incorporate the above four terms and establish the objective function of our GNMFSD as below:

algorithm with the label constraints reduces to minimize the following objective function:

min ‖𝑋 − 𝑈 𝑍 𝑇 𝐴𝑇 ‖2𝐹 + 𝜆‖𝑋𝑙 − 𝑈 𝑍 𝑇 𝐴𝑇𝑙 ‖2𝐹

min ‖𝑋 − 𝑈 𝑍 𝑇 𝐴𝑇 ‖2𝐹

𝑈 ,𝑍

𝑈 ,𝑍

s.t.

𝑈 ≥ 0, 𝑍 ≥ 0.

+ 𝛽‖𝑋𝑎 − 𝑈 𝑍 𝑇 𝐴𝑇𝑎 ‖2𝐹 + 𝛾𝑇 𝑟(𝑍 𝑇 𝐴𝑇 𝐿𝐴𝑍)

(8)

The iterative multiplicative updating schemes to optimize the objective function (8) are presented as: 𝑢𝑖𝑘 ← 𝑢𝑖𝑘

(𝐴𝑇 𝑋 𝑇 𝑈 )𝑖𝑗 (𝑋𝐴𝑍)𝑖𝑘 , 𝑧𝑖𝑗 ← 𝑧𝑖𝑗 𝑇 . 𝑇 𝑇 (𝑈 𝑍 𝐴 𝐴𝑍)𝑖𝑘 (𝐴 𝐴𝑍𝑈 𝑈 𝑇 )𝑖𝑗

3.2. Update rules The objective function in (14) are not convex in both 𝑈 and 𝑍 together. Therefore, it is hard to find the global minima. In the following, we introduce an iterative algorithm which can obtain the local minima. Applying the matrix properties 𝑇 𝑟(𝐴𝐵) = 𝑇 𝑟(𝐵𝐴) and 𝑇 𝑟(𝐴) = 𝑇 𝑟(𝐴𝑇 ), the objective function in (14) can be rewritten as:

3. Graph regularized nonnegative matrix factorization with sample diversity In this section, we present the details of our Graph Regularized Nonnegative Matrix Factorization with Sample Diversity (GNMFSD) algorithm, which integrates the label information, unlabeled sample diversity and discriminative basis vectors learning into the procedure of nonnegative matrix factorization. Then, an optimization scheme based on iterative updating rules is used to solve the objective function. Finally, the convergence analysis of our algorithm is described.

𝑂𝐹 = ‖𝑋 − 𝑈 𝑍 𝑇 𝐴𝑇 ‖2𝐹 + 𝜆‖𝑋𝑙 − 𝑈 𝑍 𝑇 𝐴𝑇𝑙 ‖2𝐹 + 𝛽‖𝑋𝑎 − 𝑈 𝑍 𝑇 𝐴𝑇𝑎 ‖2𝐹 + 𝛾𝑇 𝑟(𝑍 𝑇 𝐴𝑇 𝐿𝐴𝑍) = 𝑇 𝑟((𝑋 − 𝑈 𝑍 𝑇 𝐴𝑇 )(𝑋 − 𝑈 𝑍 𝑇 𝐴𝑇 )𝑇 ) + 𝜆𝑇 𝑟((𝑋𝑙 − 𝑈 𝑍 𝑇 𝐴𝑇𝑙 )(𝑋𝑙 − 𝑈 𝑍 𝑇 𝐴𝑇𝑙 )𝑇 ) + 𝛽𝑇 𝑟((𝑋𝑎 − 𝑈 𝑍 𝑇 𝐴𝑇𝑎 )(𝑋𝑎 − 𝑈 𝑍 𝑇 𝐴𝑇𝑎 )𝑇 ) + 𝛾𝑇 𝑟(𝑍 𝑇 𝐴𝑇 𝐿𝐴𝑍) = 𝑇 𝑟(𝑋𝑋 𝑇 ) − 2𝑇 𝑟(𝑋𝐴𝑍𝑈 𝑇 ) + 𝑇 𝑟(𝑈 𝑍 𝑇 𝐴𝑇 𝐴𝑍𝑈 𝑇 ) + 𝜆𝑇 𝑟(𝑋𝑙 𝑋𝑙𝑇 )

3.1. Objective function

− 2𝜆𝑇 𝑟(𝑋𝑙 𝐴𝑙 𝑍𝑈 𝑇 ) + 𝜆𝑇 𝑟(𝑈 𝑍 𝑇 𝐴𝑇𝑙 𝐴𝑙 𝑍𝑈 𝑇 ) + 𝛽𝑇 𝑟(𝑋𝑎 𝑋𝑎𝑇 ) R𝑚×𝑛 ,

− 2𝛽𝑇 𝑟(𝑋𝑎 𝐴𝑎 𝑍𝑈 𝑇 ) + 𝛽𝑇 𝑟(𝑈 𝑍 𝑇 𝐴𝑇𝑎 𝐴𝑎 𝑍𝑈 𝑇 ) + 𝛾𝑇 𝑟(𝑍 𝑇 𝐴𝑇 𝐿𝐴𝑍)

Given a nonnegative training data matrix 𝑋 = [𝑥1 , … , 𝑥𝑛 ] ∈ where 𝑚 is the number of features and 𝑛 is the number of samples. It contains the labeled samples 𝑋𝑙 = [𝑥1 , … , 𝑥𝑟 ] ∈ R𝑚×𝑟 and unlabeled samples 𝑋𝑢 = [𝑥𝑟+1 , … , 𝑥𝑛 ] ∈ R𝑚×(𝑛−𝑟) . NMF tries to learn a basis matrix 𝑈 and a coefficient matrix 𝑉 whose product approximates 𝑋 as close as possible. With the label constraint matrix 𝐴 and auxiliary matrix 𝑍 (Liu et al., 2011), the original data 𝑋 is approximated as 𝑋 ≈ 𝑈 𝑉 𝑇 = 𝑈 (𝐴𝑍)𝑇 . In this work, we expect to obtain a good data representation that preserves both the local geometrical structure and the global discriminating information. Therefore, the graph manifold regularization is employed to reflect the intrinsic geometry of the data distribution, and linear regression term based on the labeled samples are used to promote the global discriminant information learning. By the label information, the above two terms can be described as below: 𝑍

min ‖𝑋𝑙 − 𝑈 𝑍 𝑇 𝐴𝑇𝑙 ‖2𝐹 𝑈 ,𝑍

where 𝑇 𝑟(⋅) is the trace of a matrix. As there are two constraint conditions 𝑈 , 𝑍 ≥ 0, we define two Lagrange multiplier 𝛺 = [𝜔]𝑖𝑗 and 𝛷 = [𝜙]𝑖𝑗 , respectively. Then the objective function can be rewritten as: 𝐿 = 𝑂𝐹 + 𝑇 𝑟(𝛺𝑈 𝑇 ) + 𝑇 𝑟(𝛷𝑍 𝑇 ).

𝑈 ,𝑍

(10)

𝑈 ,𝑍

𝜕𝐿 = −2𝑋𝐴𝑍 − 2𝜆𝑋𝑙 𝐴𝑙 𝑍 − 2𝛽𝑋𝑎 𝐴𝑎 𝑍 + 2𝑈 𝑍 𝑇 𝐴𝑇 𝐴𝑍 𝜕𝑈 + 2𝜆𝑈 𝑍 𝑇 𝐴𝑇𝑙 𝐴𝑙 𝑍 + 2𝛽𝑈 𝑍 𝑇 𝐴𝑇𝑎 𝐴𝑎 𝑍 + 𝛹

(16)

𝜕𝐿 = −2𝐴𝑇 𝑋 𝑇 𝑈 − 2𝜆𝐴𝑇𝑙 𝑋𝑙𝑇 𝑈 − 2𝛽𝐴𝑇𝑎 𝑋𝑎𝑇 𝑈 + 2𝐴𝑇 𝐴𝑍𝑈 𝑇 𝑈 𝜕𝑍 + 2𝜆𝐴𝑇𝑙 𝐴𝑙 𝑍𝑈 𝑇 𝑈 + 2𝛽𝐴𝑇𝑎 𝐴𝑎 𝑍𝑈 𝑇 𝑈 + 2𝛾𝐴𝑇 𝐿𝐴𝑍 + 𝛺.

(17)

Using the Karush–Kuhn–Tucker (KKT) conditions 𝜔𝑖𝑗 𝑢𝑖𝑗 = 0 and 𝜙𝑖𝑗 𝑧𝑖𝑗 = 0, we get the following equations for 𝑢𝑖𝑗 and 𝑧𝑖𝑗 : (11) −(𝑋𝐴𝑍)𝑢 − 𝜆(𝑋𝑙 𝐴𝑙 𝑍)𝑢 − 𝛽(𝑋𝑎 𝐴𝑎 𝑍)𝑢 + (𝑈 𝑍 𝑇 𝐴𝑇 𝐴𝑍)𝑢 + 𝜆(𝑈 𝑍 𝑇 𝐴𝑇𝑙 𝐴𝑙 𝑍)𝑢 + 𝛽(𝑈 𝑍 𝑇 𝐴𝑇𝑎 𝐴𝑎 𝑍)𝑢 + 𝛹 𝑢 = 0 𝑇

− (𝐴 𝑋 𝑈 )𝑧 − 𝜆(𝐴𝑇𝑙 𝑋𝑙𝑇 𝑈 )𝑧 − 𝛽(𝐴𝑇𝑎 𝑋𝑎𝑇 𝑈 )𝑧 + (𝐴𝑇 𝐴𝑍𝑈 𝑇 𝑈 )𝑧 + 𝜆(𝐴𝑇𝑙 𝐴𝑙 𝑍𝑈 𝑇 𝑈 )𝑧 + 𝛽(𝐴𝑇𝑎 𝐴𝑎 𝑍𝑈 𝑇 𝑈 )𝑧 + 𝛾(𝐴𝑇 𝐿𝐴𝑍)𝑧 + 𝛺𝑧 =

(12)

(18)

𝑇

(19)

0

The above equations lead to the following two updating rules:

In addition, we try to utilize the unlabeled sample diversity to obtain more effective representations of the samples, and the virtual alternative samples 𝑋𝑎 ∈ R𝑚×(𝑛−𝑟) are derived by adding the Gaussian noise in the original unlabeled samples 𝑋𝑢 . The addition of noise (with a small amplitude) to the samples is equivalent to a Tikhonov regularization, which is useful for obtaining a robust framework (van Someren et al., 2001). The virtual alternative training samples are expected to be well represented under the same base matrix 𝑈 : min ‖𝑋𝑎 − 𝑈 𝑍 𝑇 𝐴𝑇𝑎 ‖2𝐹

(15)

The partial derivatives of (15) with respect to 𝑈 and 𝑍 are:

where 𝐿 denotes a Laplacian matrix 𝐿 = 𝐷 − 𝑊 , and 𝐴𝑙 represents the first 𝑟 columns of 𝐴. Meanwhile, we minimize the loss function with prior label information in order to guarantee the reconstructive power of our model: min ‖𝑋 − 𝑈 𝑍 𝑇 𝐴𝑇 ‖2𝐹 .

(14)

where the constants 𝜆, 𝛽, 𝛾 are the tradeoff parameters to balance the contribution of regularization terms in our objective function.

(9)

In the following section, we will describe how to construct a novel semi-supervised matrix decomposition and introduce our proposed algorithm formally.

min 𝑇 𝑟(𝑍 𝑇 𝐴𝑇 𝐿𝐴𝑍)

𝑈 ≥ 0, 𝑍 ≥ 0

s.t.

𝑢𝑖𝑘 ← 𝑢𝑖𝑘 𝑧𝑘𝑗 ← 𝑧𝑘𝑗

(𝑋𝐴𝑍 + 𝜆𝑋𝑙 𝐴𝑙 𝑍 + 𝛽𝑋𝑎 𝐴𝑎 𝑍)𝑖𝑘

(20)

(𝑈 𝑍 𝑇 𝐴𝑇 𝐴𝑍 + 𝜆𝑈 𝑍 𝑇 𝐴𝑇𝑙 𝐴𝑙 𝑍 + 𝛽𝑈 𝑍 𝑇 𝐴𝑇𝑎 𝐴𝑎 𝑍)𝑖𝑘 (𝐴𝑇 𝑋 𝑇 𝑈 + 𝜆𝐴𝑇𝑙 𝑋𝑙𝑇 𝑈 + 𝛽𝐴𝑇𝑎 𝑋𝑎𝑇 𝑈 + 𝛾𝐴𝑇 𝑊 𝐴𝑍)𝑘𝑗 (𝐴𝑇 𝐴𝑍𝑈 𝑇 𝑈

+ 𝜆𝐴𝑇𝑙 𝐴𝑙 𝑍𝑈 𝑇 𝑈 + 𝛽𝐴𝑇𝑎 𝐴𝑎 𝑍𝑈 𝑇 𝑈 + 𝛾𝐴𝑇 𝐷𝐴𝑍)𝑘𝑗

.

(21)

We have the following theorem regarding the above iterative updating rules, which guarantees the objective function convergence to a local minimum: Theorem 1. The objective function in (14) is nonincreasing under the update rules in (20) and (21). The objective function is invariant under these updates if and only if 𝑈 and 𝑍 are at a stationary point.

(13)

where 𝐴𝑎 represents the last 𝑛 − 𝑟 columns of 𝐴. 34

C. Wang et al.

Engineering Applications of Artificial Intelligence 68 (2018) 32–39

To prove the above inequality, we have

3.3. Convergence analysis To prove Theorem 1, we first introduce an auxiliary function as used in the Expectation–Maximization algorithm (Depster et al., 1977).

(𝐴𝑇 𝐴𝑍𝑈 𝑇 𝑈 )𝑖𝑗 =

≥ (𝐴 𝐴𝑍)𝑖𝑗 (𝑈 𝑇 𝑈 )𝑗𝑗 ≥ ≥

(22)

𝑥

(𝐴𝑇 𝐷𝐴𝑍)𝑖𝑗 =

(33)

𝑀 ∑ (𝐴𝑇 𝐷𝐴)𝑖𝑝 (𝑧)𝑡𝑝𝑗 𝑝=1

≥ (𝐴𝑇 (𝐷 − 𝑊 )𝐴)𝑖𝑖 (𝑧)𝑡𝑖𝑗

(23)

= (𝐴𝑇 𝐿𝐴)𝑖𝑖 (𝑧)𝑡𝑖𝑗 .

(34)

Similar to (33), we can easily get (𝐴𝑇𝑎 𝐴𝑎 𝑍𝑈 𝑇 𝑈 )𝑖𝑗 (𝐴𝑇𝑎 𝐴𝑎 )𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗 and (𝐴𝑇𝑙 𝐴𝑙 𝑍𝑈 𝑇 𝑈 )𝑖𝑗 ≥ 𝑧𝑡𝑖𝑗 (𝐴𝑇𝑙 𝐴𝑙 )𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗 .



𝑧𝑡𝑖𝑗

Thus, (28) holds and 𝐺(𝑧, 𝑧𝑡𝑖𝑗 ) ≥ 𝐹𝑖𝑗 (𝑧). Similarly, we can also prove that the objective function is nonincreasing regarding the updating rule of 𝑈 . □

𝐺(𝑧, 𝑧𝑡𝑖𝑗 ) = 𝐹 (𝑧𝑡𝑖𝑗 ) + 𝐹 ′ (𝑧𝑡𝑖𝑗 )(𝑧 − 𝑧𝑡𝑖𝑗 ) 𝐴𝑇 𝐴𝑍𝑈 𝑇 𝑈 + 𝜆𝐴𝑇𝑙 𝐴𝑙 𝑍𝑈 𝑇 𝑈 + 𝛽𝐴𝑇𝑎 𝐴𝑎 𝑍𝑈 𝑇 𝑈 + 𝛾𝐴𝑇 𝐷𝐴𝑍 𝑧𝑡𝑖𝑗

× (𝑧 − 𝑧𝑡𝑖𝑗 )2 .

(𝐴𝑇 𝐴)𝑖𝑞 𝑧𝑡𝑞𝑗 (𝑈 𝑇 𝑈 )𝑗𝑗 𝑞=1 𝑧𝑡𝑖𝑗 (𝐴𝑇 𝐴)𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗

≥ (𝐴𝑇 𝐷𝐴)𝑖𝑖 (𝑧)𝑡𝑖𝑗

Lemma 2. Let 𝐹 ′ represent the first order derivative with regard to 𝑍. Define 𝐺 as an auxiliary function with respect to 𝑧𝑖𝑗 of the objective function in (14): font=

+

𝑘 ∑

and

From Lemma 1, we obtain that the equality 𝐹 (𝑥𝑡+1 ) = 𝐹 (𝑥𝑡 ) holds only if 𝑥𝑡 is a local minimum of 𝐺(𝑥, 𝑥𝑡 ). By iteratively applying update rule (22), we could obtain a local minimum 𝑥 of the objective function by the converged sequence of estimates: 𝐹 (𝑥𝑚𝑖𝑛 ) ≤ ⋯ ≤ 𝐹 (𝑥𝑡+1 ) ≤ 𝐹 (𝑥𝑡 ) ≤ ⋯ ≤ 𝐹 (𝑥1 ) ≤ 𝐹 (𝑥0 ).

(𝐴𝑇 𝐴𝑍)𝑖𝑞 (𝑈 𝑇 𝑈 )𝑞𝑗

𝑞=1 𝑇

Lemma 1. If there exists an auxiliary function 𝐺 for 𝐹 (𝑥) with the properties 𝐺(𝑥, 𝑥′ ) ≥ 𝐹 (𝑥) and 𝐺(𝑥, 𝑥) = 𝐹 (𝑥), then 𝐹 is nonincreasing under the update: 𝑥𝑡+1 = arg min 𝐺(𝑥, 𝑥′ ).

𝑘 ∑

Therefore, Theorem 1 guarantees that the objective function is nonincreasing under the update rules of (20) and (21).

(24)

Proof. Obviously, 𝐺(𝑧, 𝑧) = 𝐹𝑖𝑗 (𝑧). For the condition 𝐺(𝑧, 𝑧𝑡𝑖𝑗 ) ≥ 𝐹𝑖𝑗 (𝑧), we compare the auxiliary function 𝐺(𝑧, 𝑧𝑡𝑖𝑗 ) with the Taylor series expansion of 𝐹𝑖𝑗 (𝑧):

4. Experimental results To evaluate the performance of the proposed method, we conduct some experiments for image clustering, and compare the experimental results with several representative methods, including K-means, NMF, GNMF, CNMF and GSNMFC.

1 ′′ 𝐹𝑖𝑗 (𝑧) = 𝐹 (𝑧 − 𝑧𝑡𝑖𝑗 )2 (25) 2 𝑖𝑗 where 𝐹 ′′ is the second-order partial derivative of the objective function with respect to 𝑍. It is easy to check that 𝐹𝑖𝑗 (𝑧𝑡𝑖𝑗 ) + 𝐹𝑖𝑗′ (𝑧 − 𝑧𝑡𝑖𝑗 ) +

4.1. Evaluation metrics

𝜕𝑂𝐹 ) = (−2𝐴𝑇 𝑋 𝑇 𝑈 − 2𝜆𝐴𝑇𝑙 𝑋𝑙𝑇 𝑈 − 2𝛽𝐴𝑇𝑎 𝑋𝑎𝑇 𝑈 𝜕𝑍 𝑖𝑗 + 2𝐴𝑇 𝐴𝑍𝑈 𝑇 𝑈 + 2𝜆𝐴𝑇𝑙 𝐴𝑙 𝑍𝑈 𝑇 𝑈 + 2𝛽𝐴𝑇𝑎 𝐴𝑎 𝑍𝑈 𝑇 𝑈 + 2𝛾𝐴𝑇 𝐿𝐴𝑍)𝑖𝑗 (26) 𝐹𝑖𝑗′ = (

The clustering performance is evaluated by comparing the obtained label of each sample with that provided by the data set. Two metrics are used to evaluate the clustering performance (Cai et al., 2005). One metric is accuracy (AC) and the other is the normalized mutual information metric (nMI). Given a data point 𝑥𝑖 , let 𝑟𝑖 and 𝑠𝑖 be the cluster label obtained by applying different algorithms and the label provided by the data set, respectively. The AC is defined as follows: ∑𝑁 𝛿(𝑟𝑖 , 𝑚𝑎𝑝(𝑠𝑖 )) (35) 𝐴𝐶 = 𝑖=1 𝑁 where 𝑁 is the total number of samples, and 𝛿(𝑥, 𝑦) is 1 if 𝑥 = 𝑦 and 0 otherwise. 𝑚𝑎𝑝(𝑠𝑖 ) is the mapping function that maps each cluster label to the corresponding label in the data set. The best mapping is determined by using the Kuhn–Munkres algorithm. The second metric used in clustering applications is the normalized mutual information (nMI). It aims to measure the similarity of two clusters. Given two sets of image clusters  = {𝑐1 , … , 𝑐𝑘 } and  ′ = {𝑐1′ , … , 𝑐𝑘′ }, their mutual information metric MI(,  ′ ) is defined as:

𝐹𝑖𝑗′′ = 2(𝐴𝑇 𝐴)𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗 + 2𝜆(𝐴𝑇𝑙 𝐴𝑙 )𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗 + 2𝛽(𝐴𝑇𝑎 𝐴𝑎 )𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗 + 2𝛾(𝐴𝑇 𝐿𝐴)𝑖𝑖 .

(27)

Putting (27) into (25), and comparing the second order terms of the auxiliary function (24) with the second order terms of the Taylor series expansion, we can see that, instead of verifying 𝐺(𝑧, 𝑧𝑡𝑖𝑗 ) ≥ 𝐹𝑖𝑗 (𝑧), it is equivalent to prove 𝐴𝑇 𝐴𝑍𝑈 𝑇 𝑈 + 𝜆𝐴𝑇𝑙 𝐴𝑙 𝑍𝑈 𝑇 𝑈 + 𝛽𝐴𝑇𝑎 𝐴𝑎 𝑍𝑈 𝑇 𝑈 + 𝛾𝐴𝑇 𝐷𝐴𝑍 𝑧𝑡𝑖𝑗 ≥ (𝐴𝑇 𝐴)𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗 + 𝜆(𝐴𝑇𝑙 𝐴𝑙 )𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗 + 𝛽(𝐴𝑇𝑎 𝐴𝑎 )𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗 + 𝛾(𝐴𝑇 𝐿𝐴)𝑖𝑖 .

(28)

To prove the above inequality, it is equivalent to prove the following four inequalities: (𝐴𝑇 𝐴𝑍𝑈 𝑇 𝑈 )𝑖𝑗 𝑧𝑡𝑖𝑗 (𝐴𝑇𝑎 𝐴𝑎 𝑍𝑈 𝑇 𝑈 )𝑖𝑗 𝑧𝑡𝑖𝑗 (𝐴𝑇𝑙 𝐴𝑙 𝑍𝑈 𝑇 𝑈 )𝑖𝑗 𝑧𝑡𝑖𝑗 (𝐴𝑇 𝐷𝐴𝑍)𝑖𝑗 𝑧𝑡𝑖𝑗

≥ (𝐴𝑇 𝐴)𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗

(29)

𝑀𝐼(,  ′ ) = ≥ (𝐴𝑇𝑎 𝐴𝑎 )𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗

≥ (𝐴𝑇𝑙 𝐴𝑙 )𝑖𝑖 (𝑈 𝑇 𝑈 )𝑗𝑗

≥ (𝐴𝑇 𝐿𝐴)𝑖𝑖 .

∑ 𝑐𝑖 ∈,𝑐𝑗′ ∈ ′

(30)

𝑝(𝑐𝑖 , 𝑐𝑗′ ) ⋅ 𝑙𝑜𝑔

𝑝(𝑐𝑖 , 𝑐𝑗′ ) 𝑝(𝑐𝑖 ) ⋅ (𝑐𝑗′ )

(36)

where 𝑝(𝑐𝑖 ), 𝑝(𝑐𝑗′ ) represent the probabilities that an image arbitrarily selected from the data set belongs to the clusters ,  ′ , respectively, and 𝑝(𝑐𝑖 , 𝑐𝑗′ ) represents the joint probability that this arbitrarily selected image belongs to both clusters simultaneously. MI(,  ′ ) takes values between zero and max(𝐻(), 𝐻( ′ )), where 𝐻() and 𝐻( ′ ) represent the entropy of the clusters  and  ′ , respectively. MI(,  ′ ) reaches the maximum when two sets of image clusters are identical and it

(31)

(32)

35

C. Wang et al.

Engineering Applications of Artificial Intelligence 68 (2018) 32–39

Table 1 Statistics of the data sets.

Table 2 Clustering performance on the PIE database: AC (%).

Dataset

Size (n)

Dimensionality (m)

# of classes

𝑘

K-means

NMF

GNMF

CNMF

GSNMFC

GNMFSD

PIE COIL20 MNIST

2856 1440 2000

1024 1024 784

68 20 10

4 6 8 10 20 30 40 50 60 𝐴𝑣𝑔.

41.52 34.82 32.65 29.76 27.61 26.20 25.59 25.25 24.35 29.75

57.77 57.94 55.60 56.69 57.08 57.62 57.57 57.79 57.34 57.27

90.48 87.50 85.82 83.49 82.10 79.03 77.95 75.91 75.19 81.94

61.61 61.71 61.26 59.21 61.35 60.94 60.79 62.80 61.73 61.27

94.91 94.92 88.01 85.70 81.10 78.56 76.14 76.50 75.52 83.48

95.33 97.02 91.04 90.95 83.01 82.37 78.93 77.81 78.00 86.05

equals to zero when the two sets are completely independent. The advantage of MI(,  ′ ) is that the value keeps the same for all kinds of permutations. By dividing the mutual information by max(𝐻(), 𝐻( ′ )), the normalized mutual information nMI is derived as follows: 𝑛𝑀𝐼(,  ′ ) =

𝑀𝐼(,  ′ ) . 𝑚𝑎𝑥(𝐻(), 𝐻( ′ ))

Table 3 Clustering performance on the PIE database: nMI (%).

(37)

4.2. Clustering results The clustering performance is evaluated on three publicly available image data sets, namely PIE.1 COIL202 and MNIST.3 The important statistics of these data sets are summarized in Table 1. Two metrics, namely the accuracy (AC) and the normalized mutual information (nMI), are used to evaluate the results. In order to evaluate the algorithm performances over different sample sizes, different cluster numbers are selected in our experiments. We choose {4,6,8,10,20,30,40,50,60} clusters respectively from PIE database, {4,6,8,10,12,14,16,18,20} clusters from COIL20 database, and {2,3,4,5,6,7,8,9,10} clusters from MNIST database. For the fixed cluster number 𝑘, we randomly choose 𝑘 categories from the database and mix the images of 𝑘 categories as the data set 𝑋 for clustering. For fairness, the parameter values of each method remain the same for all evaluation runs, and these values are set as suggested as in the corresponding literature. To measure the weight matrix 𝑊 in the built neighboring graph, the 0-1 weight is used in GNMF, while the heating kernel weight is used in GSNMFC and our proposed GNMFSD. For all the graph-based methods, the number of nearest neighbors 𝐾 is set to be 5 as suggested in Cai et al. (2011). By striking a balance between AC and nMI to show the superiority of our proposed algorithm, we set 𝜆 = 100, 𝛽 = 1, 𝛾 = 100 in all experiments, and 10 percent images from each category are randomly chosen as the labeled samples. We set the new dimensionality of the new space equal to the number of clusters, and apply K-means in the new representation. For each given cluster number 𝑘, the experiments are repeated 20 runs on the randomly chosen clusters and the average clustering performance is recorded as the final result. In our proposed GNMFSD, the virtual alternative samples 𝑋𝑎 are derived by adding the Gaussian noise in the original unlabeled samples, the mean is set to zero and the standard deviation is set to 0.1. The accuracy and normalized mutual information are calculated by the predicted and given labels, and the best results for each database are highlighted in bold in the tables.

2 3

K-means

NMF

GNMF

CNMF

GSNMFC

GNMFSD

4 6 8 10 20 30 40 50 60 𝐴𝑣𝑔.

23.38 30.17 33.95 34.33 44.34 48.06 50.19 52.27 53.00 41.08

45.86 57.92 61.70 66.09 73.57 77.62 79.19 80.31 81.03 69.25

89.09 88.10 88.68 88.20 89.43 89.33 89.06 88.67 88.29 88.76

50.02 60.72 65.07 66.63 76.27 78.93 80.36 81.93 82.34 71.36

94.32 93.79 91.38 90.74 89.02 89.34 88.64 89.09 89.13 90.61

94.94 95.64 92.81 92.75 90.14 90.97 89.84 89.87 90.07 91.89

Table 4 Clustering performance on the COIL20 database: AC (%). 𝑘

K-means

NMF

GNMF

CNMF

GSNMFC

GNMFSD

4 6 8 10 12 14 16 18 20 𝐴𝑣𝑔.

82.47 72.03 75.14 69.76 68.65 66.57 65.01 61.61 62.70 69.33

79.36 68.58 75.21 71.88 70.83 66.94 64.28 64.21 63.02 69.37

89.81 87.42 87.76 87.56 84.73 80.99 80.63 78.82 77.82 83.95

80.85 73.63 77.16 76.07 72.82 68.72 67.21 67.48 65.42 72.15

96.82 89.90 88.19 86.98 84.69 83.58 80.83 79.74 79.66 85.59

98.04 90.90 87.80 87.35 86.15 83.00 82.86 81.71 80.09 86.43

achieves 1.2% improvement over GSNMFC. This demonstrates that the proposed approach can get more effective representations for data clustering to improve the performance. 4.2.2. COIL20 database The COIL20 image database contains 32 × 32 gray scale face images of 20 objects viewed from varying angles, see Fig. 2 for some sample images. Each object has 72 images in our experiments. According to the Tables 4 and 5, we can see that our approach GNMFSD can achieve the best performance in most cases. The average clustering accuracies obtained by Kmeans, NMF, GNMF, CNMF, GSNMFC and GNMFSD are 69.33%, 69.37%, 83.95%, 72.15%, 85.59%, and 86.43%, respectively. In this database, GNMFSD and GSNMFC get similar performance. But GNMFSD still narrowly beat GSNMFC both in accuracy and normalized mutual information. In addition, GNMF, GNMFSD and GSNMFC significantly outperform other three algorithms in term of both accuracy and normalized mutual information. This is because there is a graph manifold regularization to preserve the local geometrical structure in these three algorithms.

4.2.1. PIE database The CMU PIE database contains 32 × 32 gray scale face images of 68 people, each person has 42 facial images under different poses, light and illumination conditions. Images are preprocessed in advance so that faces are located. The samples of PIE database are depicted in Fig. 1. Tables 2 and 3 show the detailed clustering accuracy and normalized mutual information by six algorithms, respectively. Our proposed algorithm consistently outperform all the other algorithms. The average clustering accuracies obtained by Kmeans, NMF, GNMF, CNMF, GSNMFC and GNMFSD are 29.75%, 57.27%, 81.94%, 61.27%, 83.48%, and 86.05%, respectively. Comparing to the best algorithm of the other five algorithms, our approach achieves 2.5% accuracy improvement. For normalized mutual information, it can be seen that our approach 1

𝑘

4.2.3. MNIST database The third database is the MNIST handwritten database, which contains 70,000 digit images of 0 through 9. The images of each class (digit) are of size 28 × 28. Thus, each digit image is represented by a 784dimensional vector. We use a test set of 2000 examples for clustering. Some sample handwritten digit images from this database are shown in Fig. 3. Similar to He et al. (2011), we randomly select 200 digit images with each class in this experiment.

http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html. http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php. http://yann.lecun.com/exdb/mnist. 36

C. Wang et al.

Engineering Applications of Artificial Intelligence 68 (2018) 32–39

Fig. 1. Sample face images from the PIE database.

Fig. 2. Sample images from the COIL20 database. Table 5 Clustering performance on the COIL20 database: nMI (%). 𝑘

K-means

NMF

GNMF

CNMF

GSNMFC

GNMFSD

4 6 8 10 12 14 16 18 20 𝐴𝑣𝑔.

72.33 70.30 77.22 74.43 73.98 73.42 73.55 73.34 74.02 73.62

67.57 67.00 75.96 74.69 74.04 72.93 71.91 72.87 73.21 72.24

86.74 88.53 90.59 90.90 89.70 88.75 88.42 88.64 88.95 89.02

69.22 70.89 77.64 77.69 75.31 74.02 74.19 74.80 74.55 74.26

94.88 91.32 92.51 92.19 90.67 91.09 89.77 89.58 89.51 91.28

95.78 92.35 92.63 92.30 91.67 90.91 90.50 90.53 90.09 91.86

Table 6 Clustering performance on the MNIST database: AC (%). 𝑘

K-means

NMF

GNMF

CNMF

GSNMFC

GNMFSD

2 3 4 5 6 7 8 9 10 𝐴𝑣𝑔.

92.26 74.64 66.32 64.34 61.90 58.61 57.03 54.97 52.72 64.75

90.41 73.13 63.93 59.90 58.99 55.00 52.77 50.99 48.61 61.53

96.11 78.22 74.79 65.18 63.32 62.25 55.28 57.12 49.74 66.89

91.95 76.11 68.48 65.10 62.43 60.39 56.89 56.87 54.64 65.87

95.29 82.69 81.41 77.64 71.51 68.76 65.11 58.82 53.25 72.72

96.94 86.93 83.92 82.74 80.04 74.37 70.50 69.49 65.58 78.95

Fig. 3. Sample images from the MNIST database.

Table 7 Clustering performance on the MNIST database: nMI (%).

The accuracy and normalized mutual information are reported in Tables 6 and 7. As we can see, our proposed GNMFSD achieves the best performance with different 𝑘 values. Comparing to the best algorithm other than our proposed GNMFSD, i.e., GSNMFC, our algorithm achieves 6% improvement in average accuracy and 6% improvement in normalized mutual information. These results suggest that our approach can efficiently utilize the label information, unlabeled sample diversity and local geometrical structure to improve the clustering performance.

𝑘

K-means

NMF

GNMF

CNMF

GSNMFC

GNMFSD

2 3 4 5 6 7 8 9 10 𝐴𝑣𝑔.

68.90 53.86 51.94 52.36 55.71 53.88 52.81 52.40 51.77 54.85

62.04 48.64 44.49 45.12 50.44 48.43 46.53 46.74 45.73 48.68

82.89 68.48 65.48 63.78 64.85 64.66 60.18 61.53 57.99 65.54

66.73 52.19 49.08 49.15 53.41 52.82 50.78 50.85 49.97 52.77

83.29 71.30 72.11 69.51 66.53 65.28 61.99 58.19 55.12 67.04

86.10 75.99 74.00 74.03 74.22 71.21 68.83 68.22 65.81 73.16

result. With the varied numbers of labeled samples, we can find that the cluster results are generally improved as the labeled samples increase. Thus, label information can enhance the clustering performance with proper embedding. Then, the basis vectors of PIE database learned by NMF and GNMFSD are visualized in Fig. 5 with 32 × 32 gray scale images. From Fig. 5, it is obvious to observe that the basis vectors learned by GNMFSD are much sparser than those learned by NMF. Interestingly, face parts like

4.3. Discussion Fig. 4 gives the average results including accuracy and normalized mutual information on three database versus the labeling percentage. The cluster number on PIE and COIL20 is set to 10, and 4 for MNIST. 20 test runs are conducted with the label percentage varying from 10% to 80%, and the average clustering performance is recorded as the final 37

C. Wang et al.

Engineering Applications of Artificial Intelligence 68 (2018) 32–39

Fig. 4. The performance of GNMFSD versus the labeling percentage on three databases. (a) AC. (b) nMI.

Fig. 5. Basis vectors learned from the PIE database. (a) NMF. (b) GNMFSD.

Acknowledgments

eye, cheek, and chin can now be observed in Fig. 5(b), which suggests that GNMFSD can learn a better parts-based representation than NMF. Therefore, we can conclude that a effective data representation is important to the data clustering.

This work was supported by the National Basic Research Program of China (973 Program) under Grant no. 2013CB329404, the National Natural Science Foundation of China under Grant no. 61572393, and the Special Fund for Basic Scientific Research of Central Colleges, Chang’an University no. 310812171006.

5. Conclusion and future work

References In this paper, we have proposed a novel approach called Graph Regularized Nonnegative Matrix Factorization with Sample Diversity (GNMFSD). It aims to learn an effective data representation that preserves both the local geometrical structure and the global discriminant information. The intrinsic geometric structure of the data is exploited by graph manifold regularization, while the global discriminant information is learned by linear regression term based on the labeled samples. In particular, we put restrictions on the labeled samples to control the basis vectors and representations learning, and sample diversity is utilized on the unlabeled samples to promote the effective representation learning. These properties make our approach particularly suitable for data clustering. In addition, we provide the optimization framework as well as the convergence proof for the update rules. Experimental results on three commonly used databases have demonstrated the effectiveness of our approach. In future, we will further improve the robustness and accuracy of GNMFSD. Moreover, we will extend the proposed optimization framework to deal with more complex data, e.g., data with noises, corruptions or missing entries, and investigate its performance under the framework of semi-supervised learning, which has insightful implications in some practical applications.

Belhumeur, P.N., Hepanha, J.P., Kriegman, D.J., 1997. Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19 (7), 711–720. Cai, D., He, X., Han, J., 2005. Document clustering using locality preserving indexing. IEEE Trans. Knowl. Data Eng. 17 (12), 1624–1637. Cai, D., He, X., Han, J., Huang, T.S., 2011. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33 (8), 1548–1560. Chen, J., Liu, Y., 2011. Locally linear embedding: a survey. Artif. Intell. Rev. 36 (1), 29–48. Chen, W., Zhao, Y., Pan, B., Chen, B., 2016. Supervised kernel nonnegative matrix factorization for face recognition. Neurocomputing 205, 165–181. Depster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser. B 39, 1–38. Essid, S., Fevotte, C., 2013. Smooth nonnegative matrix factorization for unsupervised audiovisual document structing. IEEE Trans. Multimedia 15 (2), 415–425. Feng, X., Jiao, Y., Lv, C., Zhou, D., 2016. Label consistent semi-supervised non-negative matrix factorization for maintenance activities identification. Eng. Appl. Artif. Intell. 52, 161–167. Gillis, N., Plemmons, R.J., 2013. Sparse nonnegative matrix underapproximation and its application to hyperspectral image analysis. Linear Algebra Appl. 438 (10), 3991– 4007. He, X., Cai, D., Shao, Y., Bao, H., Han, J., 2011. Laplacian regularized gaussian mixture model for data clustering. IEEE Trans. Knowl. Data Eng. 23 (9), 1406–1418. Hyvarinen, A., Oja, E., 2000. Independent component analysis: algorithms and applications. Neural Netw. 13 (4), 411–430.

38

C. Wang et al.

Engineering Applications of Artificial Intelligence 68 (2018) 32–39 van Someren, E., Wessels, L., Reinders, M., Backer, E., 2001. Robust genetic network modeling by adding noisy data. In: Proceedings of IEEE Workshop on Nonlinear Signal and Image Processing. Wachsmuth, E., Oram, M.W., Perrett, D.I., 1994. Recognition of objects and their component parts: reponses of single units in the temporal cortex of the macaque. Cereb. Cortex 4, 509–522. Wang, J.Y., Bensmail, H., Gao, X., 2013. Multiple graph regularized nonnegative matrix factorization. Pattern Recognit. 46 (10), 2840–2847. Wang, Y., Zhang, Y., 2013. Nonnegative matrix factorization: a comprehensive review. IEEE Trans. Knowl. Data Eng. 25 (6), 1336–1353. Xu, Y., Li, Z., Zhang, B., Yang, J., You, J., 2017. Sample diversity, representation effectiveness and robust dictionary learning for face recognition. Inform. Sci. 375, 171–182. Zhi, R., Flierl, M., Ruan, Q., Kleijn, W.B., 2011. Graph-preserving sparse nonnegative matrix factorization With Application to Facial Expression Recognition. Syst. Man Cybern. 41 (1), 38–52.

Jiang, B., Zhao, H., Tang, J., Luo, B., 2014. A sparse nonnegative matrix factorization technique for graph matching problems. Pattern Recognit. 47 (2), 736–747. Jolliffe, I., 1989. Principal Component Analysis. Springer-Verlag, New York. Lee, D.D., Seung, H.S., 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791. Li, P., Bu, J., Yang, Y., Ji, R., Chen, C., Cai, D., 2014. Discriminative orthogonal nonnegative matrix factorization with flexibility for data representation. Expert Syst. Appl. 41 (4), 1283–1293. Liu, H., Wu, Z., Li, X., Cai, D., Huang, T.S., 2011. Constrained nonnegative matrix factorization for Image Representation. IEEE Trans. Pattern Anal. Mach. Intell. 34 (7), 1299–1311. Logothetis, N.K., Sheinberg, D.L., 1996. Visual object recognition. Ann. Rev. Neurosci. 19, 577–621. Lu, N., Miao, H., 2016. Structure constrained nonnegative matrix factorization for pattern clustering and classification. Neurocomputing 171, 400–411. Shang, F., Jiao, L., Wang, F., 2012. Graph dual regularization nonnegative matrix factorization for co-clutering. Pattern Recognit. 45 (6), 2237–2250. Sun, F., Xu, M., Hu, X., Jiang, X., 2016. Graph regularized and sparse nonnegative matrix factorization with hard constraints for data representation. Neurocomputing 173, 233–244.

39