Graph regularized and sparse nonnegative matrix factorization with hard constraints for data representation

Graph regularized and sparse nonnegative matrix factorization with hard constraints for data representation

Author’s Accepted Manuscript Graph Regularized and Sparse Nonnegative Matrix Factorization with Hard Constraints for Data Representation Fuming Sun, M...

1MB Sizes 1 Downloads 163 Views

Author’s Accepted Manuscript Graph Regularized and Sparse Nonnegative Matrix Factorization with Hard Constraints for Data Representation Fuming Sun, Meixiang Xu, Xuekao Hu, Xiaojun Jiang www.elsevier.com

PII: DOI: Reference:

S0925-2312(15)01276-X http://dx.doi.org/10.1016/j.neucom.2015.01.103 NEUCOM16036

To appear in: Neurocomputing Received date: 1 July 2014 Revised date: 11 October 2014 Accepted date: 22 January 2015 Cite this article as: Fuming Sun, Meixiang Xu, Xuekao Hu and Xiaojun Jiang, Graph Regularized and Sparse Nonnegative Matrix Factorization with Hard Constraints for Data Representation, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2015.01.103 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Graph Regularized and Sparse Nonnegative Matrix Factorization with Hard Constraints for Data Representation Fuming Sun, Meixiang Xu, Xuekao Hu, Xiaojun Jiang Liaoning University of Technology, Jinzhou 121001, P R China E-mail: [email protected]

Abstract:Nonnegative Matrix Factorization (NMF) as a popular technique for finding parts-based, linear representations of nonnegative data has been successfully applied in a wide range of applications. This is because it can provide components with physical meaning and interpretations, which is consistent with the psychological intuition of combining parts to form whole. For practical classification tasks, NMF ignores both the local geometry of data and the discriminative information of different classes. In addition, existing research results demonstrate that leveraging sparseness can greatly enhance the ability of the learning parts. Motivated by these advances aforementioned, we propose a novel matrix decomposition algorithm, called Graph regularized and Sparse Non-negative Matrix Factorization with hard Constraints (GSNMFC). It attempts to find a compact representation of the data so that further learning tasks can be facilitated. The proposed GSNMFC jointly incorporates a graph regularizer and hard prior label information as well as sparseness constraint as additional conditions to uncover the intrinsic geometrical and discriminative structures of the data space. The corresponding update solutions and the convergence proofs for the optimization problem are also given in detail. Experimental results demonstrate the effectiveness of our algorithm in comparison to the state-of-the-art approaches through a set of evaluations. Key words: Nonnegative matrix factorization; graph-based regularizer; sparseness constraints; label information

1. Introduction A fundamental problem in a variety of data analysis tasks is to find an appropriate representation for the given data. The purpose of data representation is to effectively uncover the latent structure of the data so that further learning tasks, such as clustering and classification, can be facilitated. Matrix factorization techniques as fundamental tools for such data representation have been receiving more and more attention. Generally speaking, matrix factorization is non-unique and by far many different methods of doing so have been proposed by incorporating different constraints with different criteria. The canonical techniques include Non-negative

Matrix Factorization (NMF) [1, 2], the QR Decomposition (QRD), Singular Value Decomposition (SVD), Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA), Regularized LDA [3], Deterministic Column-Based Matrix Decomposition [4], etc. Matrix factorization aims to find two or more matrix factors whose product is a good approximation to the original matrix. In practical applications, the dimension of the decomposed matrix factors is often much smaller than that of the original matrix, leading to compact representation of the data points, which is helpful to other learning tasks like clustering and classification. Roughly speaking, matrix decomposition algorithms with strong performance tend to satisfy two basic conditions: (1) it could uncover the intrinsic geometric structures of the data clearly; (2) it could reduce dimensionality of original data, which in turn can facilitate other learning tasks. The aforementioned PCA and SVD decompose factorize the matrix as the linear combination of principle components. Unlike these methods, NMF [1, 2] specializes in that it enforces the constraint that the elements of the factor matrices must be nonnegative, i.e., all elements must be equal to or greater than zero. And such a nonnegative constraint leads NMF to a parts-based representation of the object in the sense that it only allows additive, not subtractive, combination of the original data. Therefore, it is an ideal dimensionality reduction algorithm for image processing, face recognition, and document clustering, where it is natural to consider the object as a combination of parts to form a whole representation. Since NMF is an unsupervised learning algorithm, it is inapplicable to many real-world problems where limited knowledge from domain experts is available. After many years of research and development, a plenty of improved methods have been proposed on the basis of NMF. Hoyer [5] and Dueck [6] et al. computed sparse matrix factorization, in which sparseness constraint is enforced to enhance the ability of learning parts. Li et al [7] put forward the Local Nonnegative Matrix Factorization (LNMF) method, for learning spatially localized, parts-based subspace representation of visual patterns, giving a set of bases which not only allows a part-based representation of images but also manifests localized features. But it has been pointed out that LNMF couldn’t represent the data very well [8]. Hoyer applied NMF to the sparse code, and proposed the Nonnegative Sparse Coding (NSC) [9]. Furthermore, he brought up a Sparse NMF (SNMF) algorithm which can be controlled explicitly [10]. Cai et al [11] presented the Graph Regularized Nonnegative Matrix Factorization (GRNMF) approach to encode the geometrical information of the data space by constructing a nearest neighbor graph to model the local manifold structure. When label information is available, it can be naturally incorporated into the graph structure. Concretely, if two data points share the same label, a large weight will be assigned to the edge connecting them. And if two data points have different labels, the corresponding weight is set to be zero. By taking the label information as additional

constraints, Liu et al [12] proposed the Constrained Nonnegative Matrix Factorization (CNMF), which incorporates the label information as additional constraints. Ding et al [13] proposed a semi-nonnegative matrix factorization algorithm where only one matrix factor is restricted to contain nonnegative entries, while it relax the constraint on the basis vectors. Yuan et al proposed Binary Sparse Nonnegative Matrix Factorization in [14], making full use of the sparseness property of the basis vector to remove easy-excluded Haar-like box functions. Due to the effectiveness and importance of data representations, recently variants of NMF have been widely applied in extensive domains, such as docum(21)ent clustering [15, 16], audiovisual document structuring [17], speech and image cryptosystems [18], image classification and annotation [19], blind source separation [20], facial expression recognition [21], and image search reranking [22], etc. A comprehensive review about the principles, basic models, properties, and algorithms of NMF is systematically introduced in [23]. What’s more, sparseness constraints together with many learning methods have been applied to video search reranking [24] and image categorization [25]. Motivated by recent progress in matrix factorization, we proposed a novel NMF method for data representation by exploiting three constraint conditions, including graph-based regularizer, sparseness requirement and prior label information offered by few labeled data points. The proposed NMF is referred as Graph regularized and Sparse Nonnegative Matrix Factorization with hard Constraints (GSNMFC) to represent the data in a more reasonable way. In this method, we incorporated hard prior label information into the graph to encode the intrinsic geometrical and discriminative structures of the data space, and also take the label information as additional constraints to decompose matrix. Furthermore, we exploited extra sparseness constraints to make the coefficients much sparser. It is under the assumption that if the sparseness levels of the factors are improved, the ability of the learning parts can be enhanced. By combining the three constraints, we expect that further learning performance, such as the recognition rate and clustering results, can be further improved in the new data representation. What’s more, we proved the convergence of the raised method. Finally, we carried out the extensive experiments on the common face databases to validate the effectiveness and efficiency of the novel matrix factorization method proposed in this paper. The main contributions of our work can be summarized as follows. (1) The proposed method possesses the merit of CNMF, which takes the label information as additional hard constraints and is parameter free. On the other hand, the proposed method has the advantage of GRNMF, which exploits the intrinsic geometric structure of the data distribution and incorporates it as an extra regularization term. In a nutshell, the algorithm presented here incorporates the virtues of the two methods mentioned above. Moreover, the proposed method can

have better performance on clustering accuracy and normalized mutual information. (2) Note that with the sparseness levels of the factors improved the discrimination ability of the learning parts can be enhanced, in which sense sparseness is rather an indispensible constraint to NMF instead of an optional one. Thereby on the basis of GRNMF and CNMF, we enforces additional sparseness constraints, which is beneficial to give rise to a much sparser data that is conductive to other learning tasks like classification and clustering. The rest of the paper is organized as follows. First, we give brief reviews of related methods including NMF, CNMF and GNMF in Section 2. Next, in section 3 our proposed algorithm is described in details and theoretical proof of convergence of our optimization scheme is provided as well. Then Section 4 reports extensive experimental results and corresponding analyses on two popular datasets followed by some concluding remarks made in Section 5.

2. Brief Reviews 2.1. NMF Given a data matrix X  [ x1 , x2 ,, xn ]  R mn , where xi  R m is a sample vector, and the elements of each sample vector is nonnegative. The goal of NMF is to seek two nonnegative matrices U  R mk and V  R nk (k  mn /( m  n)) . Especially, the elements of the two matrices are all negative, in order to guarantee that the similarity between X and UV T is the highest, that is to say, it reduces to minimize the following objective function. OF  X  UV

2

s.t. U  0,V  0

(1)

where OF is referred as the Frobenius norm. Equation (1) is a description form of the objective functions from the Euclidean space perspective. And there is another description form from the point view of divergence as below.   xij OKL  DX || UV T     xij log T  xij  uvT ij    (uv ) ij i, j  

(2)

It can be proven that (1) and (2) are both convergent. According to the aforementioned equations, the multiplicative iterative updating formula [1, 2] can be achieved as below. uik  uik

 XV ik

UV V  X V  V VU 

(3)

T

ik

T

v jk  v jk

jk

T

(4)

T

jk

where U  [wik ],V  [v jk ] . At the very beginning of the iterative update process, the

two nonnegative matrices U 0 and V0 are initialized at random. The iterative update procedure is executed repeatedly according to (3) and (4) until the given terminal condition is met. Ultimately, the final U and V can be obtained. 2.2. CNMF To the best of our knowledge, NMF is an unsupervised method for matrix decomposition, not taking the label information of the data into account. Therefore, it cannot be used directly to the situation when the label information is available. The CNMF [12] considers the label information as additional constraints. By CNMF, it can be guaranteed that the data points sharing the same label are projected into the same class in the low-dimensional space. Assume that there are n nonnegative data points  xi in1 which may belong to c possible classes. Each data point from x1 , x2 ,, xl is labeled with one class, while the rest n  l data points are not labeled. An l  c indicator matrix C is first defined as (5), where ci , j  1 if xi is labeled with the jth class; ci , j  0 otherwise. 1, if xi is labeled with the jth class ci , j   0, otherwise

(5)

With the indicator matrix C , we can define a label constraint matrix A as below. C A   cl  0

0   I n l 

(6)

where I nl is an (n  l )  (n  l ) identity matrix. With the label constraints matrix A T and the objective function in NMF, X  UV T  U  AZ  can be obtained. It is easier to

learn that if xi and x j share with the same label, then the weighted coefficient vector of them is identical. The minimizing objective function of the CNMF with the label constraints reduces to minimize the following objective function as below. OF  X  UZ T AT

2

s.t. U  0, Z  0

(7)

The updating rules to solve (7) are given in (8) and (9). uik  uik

 XAZik

(UZ T AT AZ ) ik

A

T

zij  zij

X T U ij

( AT AZU T U ) ij

(8) (9)

2.3. GNMF When factorizing a matrix using the GNMF algorithm, it is required that the data after decomposition should preserve the inherent structure information of the original data.

If two data points xi and x j are close to each other in the intrinsic geometry of the original data distribution, they are still adjacent to each other in the new data space after factorization. That is to say, on the basis of the NMF method, GNMF makes the decomposed data points maintain the manifold characteristics on the graph of the original data. Let G denote the graph built by the original data points, and define Wij as the weight matrix in (10).   xi  x j  Wij  e  , if xi and x j are adjacent 0, otherwise  2

(10)

where  is a trade-off parameter.

Let D represent a diagonal matrix, where Dii   j Wij , and L be a Laplacian matrix L  D  W . The objective function in the GNMF method is written as follows. OF  X -UV T

2

 Tr (V T LV )

s.t. U  0,V  0

(11)

where  is some constant greater than zero. The multiplicative update rules to solve (11) are given below. uik  uik

 XV ik

UV V 

(12)

T

ik

X U  WV  VU U  DV  T

v jk  v jk

jk

T

(13)

jk

Additionally, Cai et al [11] proved that the objective function under this two update rules is convergent. In this way, the manifold learning theory can be combined with NMF, which performs better further.

3. The Proposed Algorithm 3.1. The Updating Algorithm By utilizing the nonnegative constraints, NMF can learn a parts-based representation. But it performs this learning in the Euclidean space, failing to reveal the intrinsic geometrical and discrimination structure of the data space, which is of great essence to the practical applications. As for GNMF, its major disadvantage is that there is no theoretical guarantee that data points from the same class will be projected together in the new representation space, and how to choose the weights in a principled way still remains indistinct. Note that the above two refined approaches, i.e., GNMF and CNMF, have their merits and demerits. In this section we propose a novel method called Graph Sparse NMF with hard Constraints (GSNMFC), which embodies three conditions, including

graph regularization, hard constraint of prior label information and sparseness constraint. By this method, the label information and the geometric structure of the data are maintained. In addition, the representation of the data is also much sparser, which can enhance the recognition accuracy and improve the effectiveness of the proposed algorithm. Our proposed GSNMFC reduces to minimize the following objective function: OF  X  UZ T AT

2

 Tr (Z T AT LAZ )   U

2

(14)

F

s.t.,U  0, Z  0

where A is a label constraint matrix, L denotes a Laplacian matrix, L  D  W , D represents a diagonal matrix, here Dii   j Wij . Besides, W is a weight matrix,  is the regularization parameter,  is the sparse coefficient and   0,1 .

F

is the

Frobenius norm. Notice that if the sparseness levels of the factors are improved, the ability of the learning parts can be enhanced. In this sense, sparseness is rather an indispensible constraint to NMF instead of an optional one. Thus, in terms of GNMF and CNMF, we add to additional sparseness constraints (the third term in the objective function in (14)), which proposes to give rise to a much sparser data that is conductive to other learning tasks like classification and clustering. Naturally, a measure of sparseness is L0-norm. However, the involved optimization under L0-norm has proved to be NP-hard and intractable. Therefore L1-norm is frequently used instead to impose sparseness constraints [26, 27]. Despite L1-norm-based sparseness measure has achieved great success [29], it has two disadvantages: (1) it implicitly requires that the target signals are of high sparseness levels [28], or it couldn’t measure the sparseness appropriately; (2) it is simply ineffective in certain situations. From the above, we adopt the Frobenius norm here. Since the GSNMFC method is non-convex, it is very hard to find the global optimal solution. By the steepest descent method, we obtain the multiplicative iterative rule of the objective function to find the local optimal solution. The objective function in (14) can be rewritten as follows. 2

OF  X  UZ T AT  Tr

 Tr ( Z T AT LAZ )   U

2 F

 X  UZ A  X  UZ A    Tr (Z A LAZ )   U T

T

T

T T

T

T

2 F

 Tr  XX T   2Tr  XAZU T   Tr UZ T AT AZU T   Tr ( Z T AT LAZ )   U

2 F

s.t. U  0, Z  0

where Tr  is the trace of a matrix, and the matrices in the equation is nonnegative. Define   [ij ] and   [ij ] as Lagrange multipliers, then we have the Lagrange function as follows.

L  OF  TrU T   TrZ T 

(15)

Requiring that the derivatives of L regarding U and Z vanish, we have the following derivation.

L  2 XAZ  U  Z T AT AZ   U  Z T AT AZ     2U U  2 XAZ  2UZ T AT AZ    2 U

(16)

0

L  -2 AT X TU  2 AT AZU TU  2AT LAZ    0 (17) Z Using Karush-Kuhn-Tucker condition ijuij  0 and ij zij  0 , we get the following equations.   XAZ  u  UZ T AT AZ  u  Uu  0   AT X TU  z   AT AZU TU  z    AT LAZ  z  0

By these equations, two updating rules can be obtained as below.

uik  uik

UZ

 XAZ ik

T

AT AZ  U 

(18)

ik

A

X T U  AT WAZ kj AT AZU TU  AT DAZ kj T

z kj  z kj

(19)

3.2. Proof of Convergence on GSNMFC The objective function in (14) is convergent under the updating rules as (18) and (19). Next, we will introduce an auxiliary function. Lemma 1. Define G( x, x) as an auxiliary function for F (x) which meets the conditions G( x, x)  F ( x),G( x, x)  F (x) , and F is non-increasing under the updating rule

x t 1  arg minx G( x, x t ) .

(20)

From Lemma 1, we can obtain that the equation F ( x t 1 )  F ( x t ) holds only if x t is a local minimum of G( x, x t ) . If the derivatives of F exist and are continuous in a small neighborhood of x t , this also implies that the derivates F ( x t )  0 . Thus by iterating the update in (20), we can obtain a sequence of estimates that converge to a local minimum xmin  arg minx F ( x) of the objective function: F ( xmin )    F ( x t 1 )  F ( x t )    F ( x1 )  F ( x 0 )

Therefore, defining such an auxiliary function G( x, x t ) , makes the updating rule of the objective function in (14) satisfy the equality x t 1  arg minx G( x, x t ) .

Lemma 2. Let F  represent the first order derivative with regard to Z . Define G as an auxiliary function.

G( z, zijt )  F ( zijt )  F ( zijt )( z  zijt ) 

( AT AZU TU ) ij   ( AT DAZ ) ij z

t ij

( z  zijt ) 2 (21)

This is the auxiliary function with respect to zij of the objective function in (14). Proof. Obviously, G( z, z )  Fij ( z ) . According to the definition of auxiliary function, we only need to demonstrate G( z, zijt )  Fij ( z ) . In order to do this, we execute the Taylor series expansion of Fij (z ) , then have (22) Fij ( z)  Fij ( zijt )  Fij( z  zijt )  [( AATU TU ) ij   ( AT LA) ij ]( z  zijt ) 2

(22)

t After comparisons, we can see that, instead of verifying G( z, zij )  Fij ( z ) , it is

equivalent to prove

( AT AZU T U ) ij   ( AT DAZ ) ij z

t ij

 ( AAT ) ii (U T U ) jj   ( AT LA) ii

To prove the above inequality, we can have the following two inequalities as (23) and (24):

( AT AZU T U ) ij z

t ij



1 Fij  ( AAT ) ii (U T U ) jj 2

 ( AT DAZ ) z

t ij

  ( AT LA) ii

(23)

(24)

With ( AT AZU T U )   ( AT AZ )(U T U ) l

 ( AT AZ )(U T U )   ( AT A) z t (U T U ) l

 z t ( AT A)(U T U )

And m

 ( AT DAZ ) ii    ( AT DA) aj z tjb j 1

  ( AT DA) ii z ijt   ( AT ( D  W ) A) ii z ijt   ( AT LA) ii z ijt Then we have G( z, zijt )  Fij ( z ) when observing that both (23) and (24) hold.

Similarly, we can get that the objective function is nonincreasing regarding the updating rule of u . Therefore, the objective function is a nonincreasing function under this updating rule.

4 Experimental Results In this section, we investigated the use of our proposed GSNMFC algorithm for image clustering on three widely used benchmark databases and compared it against four state-of-the-art algorithms. Extensive experimental results show the efficiency and validity of our proposed approach for data clustering in application of human face images analysis. 4.1 Databases To evaluate our proposed method, we carried out the experiments on three common databases including PIE_pose27, ORL_32 and Yale_32. Each of these three databases contains a variety of categories of human face images. And the relevant descriptions of them are described as follows. Besides, Table 1 gives the specific statistics of three databases in size, dimensionality and class. (1) PIE_pose27. The PIE_pose27 database [31] consists of 2586 images with 68 people, each person under 13 different poses, 43 different lighting conditions, and with 4 different expressions. (2) ORL_32. The ORL_32 database [32] contains ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). (3) Yale_32. The Yale database [33] contains 165 grayscale images of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: center-light, w/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink. We do the same preprocessing for this data set as for the ORL data set. Thus, each image is also represented by a 1,024-dimensional vector in image space. Table 1 Statistics of the Dataset

Dataset

Size

Dimensionality

Classes

ORL_32

400

1024

40

PIE_pose27

2856

1024

68

Yale_32

165

1024

15

In all the experiments, images from the databases are preprocessed in advance so

that faces are located. Original images are first normalized in scale and orientation to ensure that the two eyes are aligned at the same position. Then the facial areas were cropped into the final images for clustering. Each image is 32*32 pixels with 256 gray levels per pixel. Some example images from the three databases are displayed as Figure 1.

(a) ORL_32

(b) PIE_pose27

(c) Yale_32

Fig. 1 The selected instances from three databases i.e. ORL, PIE and Yale, respectively

4.2 Evaluation Criteria In our experiments, the clustering performance is evaluated under two criteria [12, 30], one is accuracy (AC, for short) and the other is the normalized mutual information (NMI, for short). Then the concrete descriptions of each of them are detailed as follows. As for AC, it is used to measure the percentage of correct labels obtained. Given a data set containing n images, for each image sample mi , let l i be the cluster label obtained by using different algorithms , and ri be the label provided by the data set. Then AC is defined as: AC 



n

i 1

 (ri , m ap(li )) n



where  ( x, y) is the delta function that equals one if x  y and equals zero otherwise, and map(li ) is the mapping function that maps each cluster label l i to the equivalent label from the data set. The best mapping can be found by using the Kuhn-Munkres algorithm. In many clustering applications, mutual information is used to measure how similar two sets of clusters are. Given two sets of image clusters C and C  , their mutual information metric MI (C, C ) is defined as:

MI (C , C ) 



ci C , c j C 

pci , cj  log

p(ci , cj )

p(ci )  pcj 



where p(ci ), p(cj ) denote the probabilities that an image arbitrarily selected from the data set belongs to the clusters ci and c j , respectively, and p(ci , cj ) represents the joint probability that this arbitrarily selected image belongs to the cluster ci as

well as

c j

at the same time. MI  C, C  takes values between zero and

max( H (C ), H (C )) , where H (C ) and H (C ) are the entropies of C and C  , respectively. It reaches the maximum max( H (C ), H (C )) when the two sets of image clusters are identical and it becomes zero when the two sets are completely independent. One important character of MI (C, C ) is that the value keeps the same for all kinds of permutations. In our experiments, we use the normalized metric MI n (C, C ) which takes values between zero and one:

MI n (C , C ) 

MI (C , C )  [0, 1] max(H (C ), H (C )) .

4.3 Parameters Selection In the proposed GSNMFC, there are three critical tradeoff parameters, i.e. the regularization parameter  , the sparse coefficient  as well as the length scale parameter. It is time-consuming to select all these parameter based on the grid search. Relatively speaking,  ,  , and  affect the performance slightly if they are set in feasible ranges that can be obtained by using the following procedure. Set the range of  as [1, 104] and Fig. 2 shows how the recognition accuracies and normalized mutual information works on Yale, ORL, and PIE datasets. Obviously, it can be seen that AC and NMI vary with different  values. Likewise, set the range of  as [0.1, 0.9] and Fig. 3 shows how the recognition accuracies and normalized mutual information works on Yale, ORL, and PIE datasets. Similarly, it can be observed that AC and NMI vary with different  values. To the best of our knowledge, there are three commonly used ways for measuring the weight matrix W in the built neighboring graph. Empirically, for image data, the heat kernel weighting is a reasonable weighting scheme. Since Wi , j is only for measuring the closeness between two data points, here we don’t teat the different weighting schemes separately. While automatically selecting  in heat kernel weighting is a challenging problem and has received a lot of interest in recent studies. Set the range of  as [0.1, 1] and Fig.4 shows the performance of GSNMFC as a function of different  values, we just take one proper value of  . A more detailed analysis of this subject is beyond the scope of this paper. Interested readers can refer to related works for more details. From all the above, by striking a balance between AC and NMI to show the superiority of the proposed algorithm, we set  =100,  =0.3,  =0.4 when executing experiments on ORL;  =150,

 =0.5,

 =0.3 when executing

experiments on PIE;  =100,  =0.4,  =0.3 when executing experiments on Yale. Last but not least, in all the experiments by repeatedly experimenting and according to others’ beneficial experimental results, the number of nearest neighbors is set to 5

empirically.

(a) AC (b) NMI Fig. 2 The performance of GSNMFC versus the parameter  on three datasets

(a) AC (b) NMI Fig. 3 The performance of GSNMFC versus the parameter  on three datasets

(a) AC (b) NMI Fig. 4 The performance of GSNMFC versus the parameter  on three datasets

4.4 Convergence Study We use iterative update rules to obtain the local optima of the objective function of GSNMFC no matter what the cost measurement is in Frobenius norm or KL

divergence. In Section 3.2, the convergence of our update rules have already been proved. Next, according to the values of all parameters set in Section 4.2, we experimentally show the convergence speed of our approach. Fig.5 displays the convergence rate of the proposed method compared with the original NMF algorithm on three databases. For each figure, x-axis represents the number of iterations and the y-axis denotes the value of the objective function. From all the convergence curves, it can be seen that the multiplicative rules for our proposed approach converges faster against the original NMF. Especially, we observe that GSNMFC converges commonly within 100 iterations on three databases, while the original NMF converges within different iterations on three databases. Concretely, NMF converges within 500 iterations on ORL_32, within 200 iterations on PIE_pose27, within 100 iterations on Yale_32. And this also demonstrates that our proposed method is relatively steady than NMF in convergence.

(a)

NMF-ORL_32

(c) NMF-PIE_pose27

(b) GSNMFC-ORL_32

(d) GSNMFC-PIE_pose27

(e) NMF-Yale_32

(f) GSNMFC- Yale_32

Fig. 5 The convergence curve of NMF and GSNMFC on three benchmark databases

4.5 Clustering Results After dimensionality reduction, clustering can be carried out in the low dimension space. To demonstrate how the clustering performance can be improved by our proposed algorithm, we compare our method with the following five existing state-of-the-art approaches on three databases. 

Our proposed Graph Sparse NMF with Hard Constraints (GSNMFC).



Graph regularized Non-negative Matrix Factorization (GNMF) in [11] which encodes the geometrical information of the data space into matrix factorization.



Nonnegative Matrix Factorization based clustering (NMF).



Constrained Nonnegative Matrix Factorization algorithm (CNMF) [12] minimizing the F-norm cost.



Concept Factorization based clustering (CF) [30].

We evaluate the clustering performance on three databases using different methods. In the experiment, we randomly choose k (k  1, 2, ,10) categories from all the datasets as the collection X for clustering. We randomly picked up ten percent images from each category in X. Especially, as for the ORL database, we pick 20 percent images from it. There are only 10 images for each category. For the fixed cluster number k , we repeated the experiment twenty times and the average result is recoded as the final result. Table 2, Table 4 and Table 6 show the detailed clustering accuracy by five methods on ORL_32, PIE_pose27 and Yale_32, respectively. Table 3, table 5 and table 7 show the normalized mutual information by five algorithms on ORL_32, PIE_pose27 and Yale_32, respectively. Comparing to the best algorithm of the other four algorithms, our method achieves 2.8 percent improvement on average in accuracy. While for normalized mutual information, the proposed approach in this paper obtains 3.42 percent improvement on average.

Table 2 The clustering accuracy on ORL_32 Accuracy(AC)/(%) k

CF

NMF

CNMF

GNMF

GSNMFC

2

87.96

90.23

98.00

98.01

98.32

3

86.91

87.99

93.33

85.00

94.41

4

77.00

85.79

90.00

86.12

90.06

5

76.40

80.20

86.00

85.81

86.59

6

73.76

79.50

85.00

84.37

84.58

7

74.89

76.86

72.86

80.86

81.81

8

77.26

72.75

71.25

73.38

80.73

9

75.33

75.01

68.89

69.33

71.33

10

74.02

71.00

67.00

72.40

72.60

Avg.

78.17

79.92

81.37

81.69

84.49

Moreover, Fig. 6, Fig. 7 as well as Fig.8 show` the graphical clustering results for ORL_32, PIE_pose27 and Yale_32 respectively with different k values. In details, Fig. 6 shows the curve of accuracy and normalized mutual information, from this figure we can see that on the ORL_32 dataset GSNMFC can achieve decent performances. But we can also see that the two algorithms CNMF and GNMF sometimes perform better than our algorithm. And also, it can be observed from Fig. 7 that on the PIE_pose27 dataset our algorithm outperforms the other four algorithms CF, NMF CNMF and GNMF. Finally, we can see from Fig. 8 that on the Yale_32 GSNMFC is superior to four algorithms and CNMF is second to GSNMF in performance. To sum up, it can be obtained that the proposed approach has its advantage over the other algorithms on all the three datasets.

(a) AC

(b) NMI Fig. 6 The clustering performance on ORL_32

(a) AC (b) NMI Fig. 7 The clustering performance on PIE_pose27

(a) AC

(b) NMI Fig. 8 The clustering performance on Yale_32 Table 3 The clustering MI on ORL_32

Normalized Mutual Information(NMI)/(%) k

CF

NMF

CNMF

GNMF

GSNMFC

2

68.89

80.34

84.23

86.45

90.22

3

62.02

81.01

79.39

78.42

85.69

4

71.33

80.77

88.09

86.82

87.21

5

74.14

75.89

83.92

85.46

88.02

6

75.71

76.90

81.58

82.83

80.99

7

72.99

77.34

78.06

80.00

83.06

8

70.58

73.71

75.90

69.17

76.37

9

71.02

75.58

76.94

75.60

79.21

10

73.05

74.72

78.33

78.81

80.01

Avg.

71.08

77.36

80.72

80.39

83.42

Table 4 The clustering accuracy on PIE_pose27 Accuracy(AC)/(%) k

CF

NMF

CNMF

GNMF

GSNMFC

2

60.72

65.31

67.33

67.42

70.01

3

58.02

63.32

65.57

66.59

67.42

4

54.89

62.32

63.88

65.88

65.83

5

50.00

60.95

62.29

63.24

64.49

6

49.87

58.40

60.25

63.00

62.93

7

49.07

57.89

59.32

62.19

63.13

8

47.18

57.05

58.51

60.76

62.08

9

47.36

56.55

57.65

60.01

61.26

10

46.92

56.29

57.27

58.68

60.77

Avg.

51.56

59.79

61.34

63.09

64.21

Table 5 The clustering MI on PIE_pose27 Normalized Mutual Information(MI)/(%) k

CF

NMF

CNMF

GNMF

GSNMFC

2

50.26

50.34

62.48

64.00

65.01

3

46.03

48.00

59.07

62.17

63.26

4

40.16

50.69

58.77

61.89

61.11

5

42.44

51.61

56.92

62.30

63.72

6

41.84

52.50

60.98

58.29

61.41

7

43.36

54.50

59.32

60.54

60.95

8

45.29

53.33

52.16

54.02

56.85

9

47.76

50.58

53.90

55.46

58.29

10

46.38

53.61

56.84

57.46

58.75

Avg.

44.84

51.68

57.83

59.57

61.04

Table 6 The clustering accuracy on Yale_32 Accuracy(AC)/(%) k

CF

NMF

CNMF

GNMF

GSNMFC

2

86.24

90.18

93.53

93.01

95.32

3

82.12

87.73

87.72

86.64

90.15

4

80.31

83.42

86.39

87.48

88.37

5

75.70

80.91

83.57

80.05

86.17

6

71.15

78.63

77.18

81.72

83.64

7

68.48

72.58

72.61

77.70

80.00

8

63.51

70.82

73.43

72.61

75.22

9

62.85

69.61

71.95

70.68

73.40

10

58.47

67.05

69.55

68.82

70.75

Avg.

72.09

77.88

79.55

79.86

82.56

Table 7 The clustering MI on Yale_32 Normalized Mutual Information(MI)/(%) k

CF

NMF

CNMF

GNMF

GSNMFC

2

65.78

75.21

76.72

76.85

78.82

3

70.21

71.45

73.58

74.51

75.34

4

63.25

68.15

70.61

74.05

76.24

5

65.31

65.37

67.36

69.38

71.80

6

64.02

66.72

69.16

65.48

70.83

7

62.54

69.30

70.62

67.34

72.31

8

60.86

70.61

68.45

69.21

70.92

9

58.10

68.85

70.66

70.35

72.09

10

62.00

65.91

71.81

71.64

71.97

Avg.

63.56

69.06

71.00

70.98

73.37

4.6 Sparseness Study The concept of “sparse coding” refers to a representational scheme where only a few units (out of a large population) are effectively used to represent typical data vectors. In fact, this implies that most units take values close to zero while only few take significantly non-zero values. Numerous sparseness measures have been proposed and used in the literature to date. Such measures are mappings from R n to R which quantify how much energy of a vector is packed into only a few components. On a normalized scale, the sparsest possible vector (only a single component is non-zero) should have a sparseness of one, whereas a vector with all elements equal should have a sparseness of zero. In this paper, we use a sparseness measure based on the relationship between the L1 norm and the L2 norm:  || x ||1   n   || x || 2   sparsenessx   n 1

2

(21)

where n is the dimensionality of X . This metric evaluates to unity if and only if X contains only a single non-zero component, and takes a value of zero if and only if all components are equal (up to signs), interpolating smoothly between the two extremes. Next, we compute the sparseness of the basis vectors learned by NMF and GSNMFC according to (21) on three databases, respectively. Fig. 9 shows the basis vectors obtained by GSNMFC and NMF on ORL_32, Fig. 10 show the basis vectors learned by GSNMFC and NMF on PIE_pose27, and Fig. 8 displays the basis vectors achieved by GSNMFC and NMF on Yale_32. Each basis vector has the dimensionality of 1024 and the unit norm. Besides, we plot these basis vectors as 32*32 gray scale images as Fig.9, Fig.10 and Fig.11. From these three figures, it is obvious to observe that the basis vectors learned by GSNMFC is sparser than those learned by NMF, which suggests that GSNMFC can learn a better parts-based representation than NMF.

(b) GSNMFC (sparseness=0.5342) (a) NMF(sparseness= 0.4882) Fig. 9 Basis images in ORL_32

(a) NMF(sparseness= 0.6635)

(b) GSNMFC(sparseness= 0.6963)

Fig. 10 The basis images in PIE_pose27

(a) NMF(sparseness= 0.5037)

(b) GSNMFC(sparseness= 0.5421)

Fig. 11 The basis images on Yale_32

5. Conclusions Motivated by recent progress in matrix factorization and manifold learning, a novel Graph Sparse NMF with hard Constraints (GSNMFC) is proposed for data representation. It is verified to have significant improvement over the four existing state-of-the-art techniques, i.e., CF, NMF, GNMF and CNMF. The GSNMFC features double sides: (1) It incorporates the virtues of two classical methods as aforementioned i.e. CNMF and GNMF. It not only takes the label information as additional hard constraints and is parameter free, but also exploits the intrinsic geometric structure of the data distribution and incorporates it as an extra regularization term, which implicitly preserves the geometric structure of the original data appropriately and makes full use of the label information provided by limited labeled data. (2) It exploits additional sparseness constrain in CNMF and GNMF, which makes the local feature information of the original data be fully utilized further. In the paper, corresponding iterative formulas and theoretical proof of convergence of the proposed method are provided. Besides, extensive experiments are performed on three commonly used databases ( i.e., PIE_pose27, ORL_32 and Yale_32) to evaluate the proposed approach. From all the experimental results above, it can be concluded that our proposed algorithm is with decent performance on both clustering accuracy and mutual information. Particularly, the presented method here is far better than CF and NMF with all k values. In addition, it has a distinct advantage over CNMF and GNMF in most cases with different k values. Due to that experiments are empirically executed at great random, it is expected to see that our proposed algorithm sometimes performs even worse than CNMF and GNMF. But, on the whole, our GSNMFC is effective and is quite competitive against the four compared state-of-the-art algorithms. In future, we will further improve the performance of GSNMFC.

Acknowledgments This work was supported by National Natural Science Foundation of China (61272214). References [1] D. D. Lee, H. S. Seung, Algorithms for nonnegative matrix factorization, Advances in Neural Information Processing Systems, 2000, 12: 556-562. [2] D. D. Lee, H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature, 401 (6755) (1999) 788-791. [3] Y. Pang, S. Wang, Y. Yuan, Learning Regularized LDA by Clustering, IEEE Trans. Neural Networks and Learning Systems (T-NNLS), DOI: 10.1109/TNNLS.2014.2306844, 2014. [4] X. Li, Y. Pang, Deterministic Column-Based Matrix Decomposition, IEEE Trans. Knowl. Data Eng., 22(1) (2010) 145-149. [5] P. O. Hoyer, Non-Negative Matrix Factorization with Sparseness Constraints, J. Machine Learning Research, (5) (2004) 1457-1469. [6] D. Dueck, Q. D. Morris, B. J. Frey, Probabilistic Sparse Matrix Factorization, Technical Report PSI-2004-23, Univ. of Toronto, 2004. [7] S. Li, X. Hou, H. Zhang, et al, Learning spatially localized, parts-based representation, in: Proceedings of Computer Vision and Pattern Recognition, Los Alamitos, California, USA, 2001. I. 207-212. [8] S. Wild, J. Cury, A. Dougherty, Improving Non-negative Matrix Factorizations through structured initialization, Pattern Recognition, 37(11) (2004) 2217-2232. [9] P. O. Hoyer, Non-negative sparse coding, in: Proceedings of IEEE Workshop on Neural Networks for Signal Processing, Martigny, Switzerland, 2002, 557-565. [10] P. O. Hoyer, Non-negative matrix factorization with sparseness constrains, Journal of Machine Learning Research, 5(9) (2004) 1457-1469. [11] D. Cai, X. He, J. Han, T. S. Huang, Graph regularized non-negative matrix factorization for data representation, IEEE Trans. Pattern Anal Mach Intell, 33(8) (2011) 1548-1560. [12] H. Liu, Z. Wu, X. Li, D. Cai, T. S. Huang, Constrained non-negative matrix factorization for image representation, IEEE Trans. Pattern Anal Mach Intell, 34 (7) (2012) 1299-1311. [13] C. Ding, T. Li, M. I. Jordan, Convex and Semi-Nonnegative Matrix Factorizations, IEEE Trans. Pattern Analysis and Machine Intelligence, 32 (1) (2010) 45-55. [14] Yuan Yuan, Xuelong Li, Yanwei Pang, et al. Binary Sparse Nonnegative Matrix Factorization. IEEE Trans. Circuits Syst. Video Techn. 19(5): 772-777 (2009). [15] F. Shahnaza, M. W. Berrya, V. Paucab, R. J. Plemmonsb, Document clustering using nonnegative matrix factorization, Information Processing Management, 42(2) (2006) 373-

386. [16] W. Xu, X. Liu, Y. Gong, Document clustering based on non-negative matrix factorization, In: Proceedings of 2003 Int. Conf. on Research and Development in Information Retrieval (SIGIR’ 03), Toronto, Canada, 2013, 267-273. [17] Slim Essid, Cedric Fevotte, Smooth nonnegative matrix factorization for unsupervised audiovisual document structuring, IEEE Trans. Multimedia, 15(2) (2013) 415-425. [18] S. Xie, Z. Yang, Y. Fu, Nonnegative matrix factorization applied to nonlinear speech and image cryptosystems, IEEE Trans. Circuits and systems-I, 55(8) (2008) 2356-2367. [19] L. Jing, C. Zhang, M. K Ng, SNMFCA: Supervised NMF-Based Image Classification and Annotation, IEEE Trans. Image Processing, 21 (11) (2012) 4508-4521. [20] G. Zhou, Z. Yang, S. Xie, J.-M. Yang, Online blind source separation using incremental nonnegative matrix factorization with volume constraint, IEEE Trans. Neural Networks, 22 (4) (2011) 550-560. [21] R. Zhi, M. Flierl, Q. Ruan, W. Bastiaan, Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition, IEEE Trans. Systems, Man, and Cybernetics-Part B: Cybernetics, 41(1) (2011) 38-52. [22] Y. Pang, Z. Ji, P. Jing, X. Li, Ranking graph embedding for learning to rerank, IEEE Trans. neural networks and learning systems, 24 (8) (2013) 1292-1303. [23] Y.–X. Wang, Y.–J. Zhang, Nonnegative matrix factorization: a comprehensive review, IEEE Trans. Knowledge and Data Engineering, 25 (6) (2013) 1336-1353. [24] X. Tian, D. Tao, Y. Rui, Sparse transfer learning for interactive video search reranking, ACM Trans. Multimedia Computing, Communications, and Applications (TOMCCAP), 8(3) (2012) [25] F. Sun, J. Tang, H. Li, G.–J. Qi, T. S. Huang, Multi-label image categorization with sparse factor representation, IEEE Trans. Image Processing, 23 (3) (2014) 1028-1037. [26] L. Weixiang, A. Nanning, L. Xiaofeng, No-negative matrix factorization for visual coding, in: Proceedings of IEEE Int. Conf. Acout. Speech, Signal Process., Apr. 2003, pp. 293-296. [27] X. Li, Y. Pang, Y. Yuan, L1-Norm-Based 2DPCA. IEEE Trans. on Systems, Man, and Cybernetics, Part B, 40(4) (2010) 1170-1175. [28] D. L. Donoho, M. Elad, Optimally spare representation in general (nonorthogonal) dictionaries via l1 minimization, in: Proceedings of Nat, Acad. Sci. United States Amer., 100 (5) (2003) 2197-2202. [29] J. M. P. Nascimento, J. M. B. Dias, Vertex component analysis: A fast algorithm to unmix hyperspetral data, IEEE Trans. Geasci. Remote Sens., 43 (4) (2005) 898-910. [30] W. Xu, Y. Gong, Document Clustering by Concept Factorization, in: Proceedings of Ann.

ACM SIGIR Conf. Research and Development in Information Retrieval, 2004. [31] http://vasc.ri.cmu.edu/idb/html/face/index.html [32] http://www.uk.research.att.com/facedatabase.html. [33] http://cvc.yale.edu/projects/yalefaces/yalefaces.html

1. Fuming Sun is currently a Professor with the School of Electronic and Information Engineering, Liaoning University of Technology, Jinzhou, China. He received the Ph.D. degree from the University of Science and Technology of China, Hefei, China, in 2007. From 2012 to 2013, he was a Visiting Scholar with the Department of Automation, Tsinghua University. His current research interests include content-based image retrieval, image content analysis, and pattern recognition. 2. Meixiang Xu received her Bachelor degree from Binzhou University, Binzhou, China, in 2012 and is currently pursuing the Master degree in Liaoning University of Technology, Jinzhou, China. Her current research interests include pattern recognition and image retrieval. 3. Xuekao Hu received his Bachelor degree from Binzhou University, Binzhou, China, in 2012 and is currently pursuing the Master degree in Liaoning University of Technology, Jinzhou, China. His current research interests include pattern recognition and image retrieval. 4. Xiaojun Jiang is currently a Associate Professor with the College of Culture and Media, Liaoning University of Technology, Jinzhou, China. She received her Bachelor degree and Master degree from Liaoning Normal University in 1998 and 2004, respectively. Her current research interests include Social Media Marketing and Information Processing.

1. Fuming Sun

2. Meixiang Xu

3. Xuekao Hu

4. Xiaojun Jiang