Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on Stiefel manifolds

Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on Stiefel manifolds

Information Processing and Management 46 (2010) 559–570 Contents lists available at ScienceDirect Information Processing and Management journal home...

348KB Sizes 172 Downloads 185 Views

Information Processing and Management 46 (2010) 559–570

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on Stiefel manifolds Jiho Yoo, Seungjin Choi * Department of Computer Science, Pohang University of Science and Technology, San 31 Hyoja-dong, Nam-gu, Pohang 790-784, Republic of Korea

a r t i c l e

i n f o

Article history: Received 1 October 2008 Received in revised form 21 December 2009 Accepted 26 December 2009 Available online 27 January 2010 Keywords: Co-clustering Document clustering Multiplicative updates Nonnegative matrix factorization Stiefel manifolds

a b s t r a c t Matrix factorization-based methods become popular in dyadic data analysis, where a fundamental problem, for example, is to perform document clustering or co-clustering words and documents given a term-document matrix. Nonnegative matrix tri-factorization (NMTF) emerges as a promising tool for co-clustering, seeking a 3-factor decomposition X  USV > with all factor matrices restricted to be nonnegative, i.e., U P 0; S P 0; V P 0: In this paper we develop multiplicative updates for orthogonal NMTF where X  USV > is pursued with orthogonality constraints, U > U ¼ I; and V > V ¼ I, exploiting true gradients on Stiefel manifolds. Experiments on various document data sets demonstrate that our method works well for document clustering and is useful in revealing polysemous words via co-clustering words and documents. Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction Automatic organization of documents becomes crucial since the number of documents to be handled increases rapidly these days. Document clustering is an important task in organizing documents automatically, which simplifies many subsequent tasks such as retrieval, summarization, and recommendation. In the bag-of-words model, a document is represented as an unordered collection of words, leading to a term-document matrix for a set of documents to be processed. A term-document matrix is nothing but co-occurrence table which is a simple case of dyadic data. Dyadic data refers to a domain with two finite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set (Hofmann, Puzicha, & Jordan, 1999). Matrix factorization-based methods have established themselves as powerful techniques in dyadic data analysis where a fundamental problem, for example, is to perform document clustering or co-clustering words and documents given a termdocument matrix. Nonnegative matrix factorization (NMF) (Lee & Seung, 1999, 2001) was successfully applied to a task of document clustering (Shahnaz, Berry, Pauca, & Plemmons, 2006; Xu, Liu, & Gong, 2003), where a term-document matrix is decomposed into a product of two factor matrices, one of which corresponds to cluster centers (prototypes) and the other of which is associated with cluster indicator vectors. Orthogonal NMF, where an orthogonality constraint is imposed on a factor matrix in the decomposition, was shown to provide more clear interpretation on a link between clustering and matrix decomposition (Ding, Li, Peng, & Park, 2006). The bases obtained from orthogonal NMF are usually more suitable for the clustering problems because they tends to point the center of the clusters, where the bases obtained from NMF tends to build a convex hull containing the data points (Fig. 1). Multiplicative updates for NMF with preserving orthogonality were recently proposed (Choi, 2008; Yoo & Choi, 2008). * Corresponding author. Tel.: +82 54 279 2259; fax: +82 54 279 2299. E-mail addresses: [email protected] (J. Yoo), [email protected] (S. Choi). URL: http://www.postech.ac.kr/~seungjin (S. Choi). 0306-4573/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2009.12.007

560

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570

(a)

(b)

0

0

Fig. 1. A synthetic example comparing NMF and orthogonal NMF in clustering problem which has three underlying clusters. The dots represent the data points and the dotted lines represent the direction of the basis vectors obtained by the NMF or orthogonal NMF algorithms. (a) NMF fails to find three clusters correctly in this case, because the algorithm builds a convex hull containing all the data points by the learnt basis vectors. (b) Orthogonal NMF correctly finds underlying clusters, because the algorithm matches exactly one basis for each data point.

Document clustering using NMF involves one-sided clustering, seeking clustering of columns based on word distributions in a term-document matrix. Co-clustering is simultaneous clustering of rows and columns in a term-document matrix, interlacing row clusterings with column clusterings at each iteration (Dhillon, Mallela, & Modha, 2003; Long, Zhang, & Yu, 2005). Nonnegative block value decomposition (NBVD) is a 3-factor decomposition, seeking an approximation of a term-document matrix by a product of row-coefficient matrix, block value matrix, and column-coefficient matrix with all these three matrices restricted to be nonnegative (Long et al., 2005). Row-coefficient matrix and column-coefficient matrix in NBVD are associated with row-clusters and column clusters, respectively. Orthogonal nonnegative matrix tri-factorization (ONM3F) is also a 3-factor decomposition of a nonnegative data matrix with orthogonality constraints imposed on factor matrices (Ding et al., 2006). NBVD and ONM3F were shown to work well for co-clustering of words and documents. Moreover, co-clustering implicitly performs an adaptive dimensionality reduction at each iteration, leading to better document clustering accuracy compared to one-sided clustering methods (Dhillon et al., 2003). In this paper we elaborate orthogonal nonnegative matrix tri-factorization, developing a new multiplicative updating algorithm with preserving orthogonality constraints. The constraint surface which is the set of orthonormal matrices is known as Stiefel manifold (Stiefel, 1935–1936). In contrast to the existing ONMTF method (Ding et al., 2006) where Lagrangian optimization is used, we derive our multiplicative updates, exploiting directly true gradient information (Smith, 1993; Edelman, Arias, & Smith, 1998) on Stiefel manifolds, such that our updating rules preserve orthogonality constraints on factor matrices in the tri-factorization of a data matrix. Throughout this paper, our method is referred to as ‘ONMTF’, distinguishing it from ONM3F that was proposed in Ding et al. (2006). The rest of this paper is organized as follows. Section 2 explains how to use NMF and nonnegative matrix tri-factorization (NMTF) for document clustering or co-clustering words and documents, deferring detailed algorithms until Section 3. Thus, in Section 2 we first explains NMF for document clustering, based on its probabilistic interpretation. Then NMTF is elucidated for co-clustering perspective, based on its probabilistic interpretation as well. In Section 3, we re-derive multiplicative updates (Lee & Seung, 2001) for NMF, using a different approach. Then we develop multiplicative updating algorithms for orthogonal NMF and NMTF, exploiting true gradient information on Stiefel manifolds. Numerical experiments on several document data sets are given in Section 4, where we compare our ONMTF to several matrix factorization methods in terms of clustering accuracy and normalized mutual information. We also demonstrate that our ONMTF is useful in revealing polysemous words via co-clustering words and documents. Finally conclusions are drawn in Section 5. 2. Nonnegative matrix tri-factorization for clustering In the vector-space model of text data, a document is represented by an M-dimensional vector x 2 RM þ , where M is the number of terms (words) in the dictionary. Given N documents, we construct a term-document matrix X 2 RMN where þ X ij corresponds to the significance of term t i in document dj that is calculated by

X ij ¼ TFij log



 N ; DFi

where TFij denotes the frequency of term t i in document dj and DFi represents the number of documents containing term ti . Elements X ij are always nonnegative and equal to zero when either corresponding terms do not appear in the document or

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570

561

appear in the all the documents. In this section, we first briefly explain how we make use of NMF for document clustering, based on its probabilistic interpretation. Then we give an elucidation of NMTF and its application to co-clustering, emphasizing a corresponding probabilistic model. 2.1. NMF for clustering NMF seeks a 2-factor decomposition of X 2 RMN that is of the form þ

X  UV > ;

ð1Þ

where factor matrices, U 2 RMK and V 2 RNK , are restricted to be nonnegative matrices. The dimension K corresponds to þ þ the number of clusters when NMF is used for clustering. Factor matrices U and V have the following dual interpretation.  When columns in X are treated as data points in M-dimensional space, columns in U are considered as basis vectors and each row in V is encoding that represents the extent to which each basis vector is used to reconstruct data vector.  Alternatively, when rows in X are data points in N-dimensional space, columns in V correspond to basis vectors and each row in U represents encoding. Applying NMF to a term-document matrix for document clustering, each column of X is treated as a data point in Mdimensional space. In such a case, the factorization (1) is interpreted as follows (see Fig. 2).  U ij corresponds to the degree to which term ti belongs to cluster cj . In other words, column j of U (denoted by uj ) is associated with a prototype vector (center) for cluster cj .  V ij corresponds to the degree document di is associated with cluster j. With appropriate normalization, V ij is proportional to a posterior probability of cluster cj given document di . More details on probabilistic interpretation of NMF for document clustering are summarized in Section 2.2.

2.2. Probabilistic interpretation of NMF Probabilistic interpretation of NMF, as in probabilistic latent semantic indexing (PLSI) (Hofmann, 1999b), was given in (Gaussier & Goutte, 2005), where it was shown that the EM algorithm to learn the aspect model for PLSI is equivalent to multiplicative updates for NMF derived from the minimization of I-divergence objective function. Such equivalence was further elaborated in (Ding, Li, & Peng, 2008), where PLSI and NMF were shown to optimize the same objective function although they are different algorithms. We give a brief overview of this probabilistic interpretation of NMF which will be a foundation for our probabilistic interpretation of NMTF. Let us consider the joint probability of term and document, pðt i ; dj Þ, which is factorized by

pðti ; dj Þ ¼

X

pðt i ; dj jck Þpðck Þ ¼

k

X

pðti jck Þpðdj jck Þpðck Þ;

ð2Þ

k

where pðck Þ is the prior probability for cluster ck . The model which has the factorization (2) is known as aspect model that was also used in PLSI (Hofmann, 1999b). Elements of the term-document matrix, X ij , can be treated as pðti ; dj Þ, provided X ij are PP divided by 1> X1 such that i j X ij ¼ 1 where 1 ¼ ½1; . . . ; 1> with appropriate dimension. Relating (2) to the factorization (1), U ik corresponds to pðti jck Þ, representing the significance of term t i in cluster ck . Apply> ing sum-to-one normalization to each column of U, i.e., UD1 U where DU  diagð1 UÞ, we have an exact relation

½UD1 U ik ¼ pðt i jck Þ:

X

U

V

Fig. 2. A pictorial example of clustering with a term-document matrix X 2 R68 is shown in the case of 2 document-clusters. The larger square indicates the þ larger value of a corresponding element in the matrix. NMF determines factor matrices U and V, where columns in U are associated with cluster centers (prototypes) and column vectors in V > are associated with cluster indicators, from which one can see well-separated 2 document-clusters.

562

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570

Assume that X is normalized such that can be rewritten as

PP i

j X ij

¼ 1. We define a scaling matrix DV  diagð1> VÞ. Then the factorization (1)

1 > X ¼ ðUD1 U ÞðDU DV ÞðVDV Þ :

ð3Þ

Comparing (3) with the factorization (2), one can see that each element of the diagonal matrix D  DU DV corresponds to cluster prior pðck Þ. In the case of unnormalized X, the prior matrix D absorbs the scaling factor. In practice, the data matrix does not have to be normalized in advance. In a task of clustering, we need to calculate the posterior of cluster pðck jdj Þ. Applying Bayes’ rule, the posterior of cluster is given by the document likelihood and cluster prior probability. That is, pðck jdj Þ is given by

h i h i   > > pðck jdj Þ / pðdj jck Þpðck Þ ¼ DðVD1 ¼ ðDU DV ÞðD1 ¼ DU V > kj : V Þ V V Þ kj

ð4Þ

kj

It follows from (4) that ðVDU Þ> yields the posterior probability of cluster, requiring the normalization of V using the diagonal  matrix DU . Thus, we assign document dj to cluster k if 

k ¼ arg max½VDU jk : k

Document clustering by NMF was first developed in Xu et al. (2003). Here we use only different normalization and summarize the algorithm below. Algorithm outline: Document clustering by NMF (1) Construct a term-document matrix X. (2) Apply NMF to X, yielding X ¼ UV > . (3) Normalize U and V:

U

UD1 U ;

V

VDU ;

where DU ¼ diagð1> UÞ.  (4) Assign document dj to cluster k if 

k ¼ arg max V jk : k

2.3. NMTF for co-clustering NMTF seeks a 3-factor decomposition of X 2 RMN that is of the form þ

X  USV > : RML ;S þ

ð5Þ RLK þ ,

RNK þ

where U 2 2 and V 2 are restricted to be nonnegative matrices. The number of rows and columns in S is associated with the number of term-clusters (L) and the number of document-clusters (K), respectively. The factorization (5) was studied under the name nonnegative block value decomposition (Long et al., 2005), where U is the row-coefficient matrix, V is the column-coefficient matrix, and S is the block value matrix. In the case where X is a term-document matrix, factor matrices determined by the decomposition (5) are interpreted as follows (see Fig. 3).

X

U

S

V

Fig. 3. A pictorial example of clustering with a term-document matrix X 2 R68 is shown in the case of 3 term-clusters and 2 document-clusters. The larger þ square indicates the larger value of a corresponding element in the matrix. NMTF determines factor matrices U, S, and V, where U and V indicate 3 termclusters and 2 document-clusters, respectively. The matrix S represents how document-clusters are related to term-clusters. Each column in S tells which term-clusters make contribution to each document-cluster. In this case, term-cluster 1 and 3 contribute to document-cluster 1 and term-cluster 2 and 3 contribute to document-cluster 2. In fact, US in NMTF becomes U in NMF in Fig. 2.

563

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570

 Combining the first two factor matrices or the last two factor matrices relates the 3-factor decomposition (5) to the 2-factor decomposition (1). Column vectors in US are associated with basis vectors for the column space of X. Row vectors in SV > correspond to basis vectors for the row space of X.  The factor matrices U and V denote the degrees of the rows and columns of X associated with their clusters. There are two types of clusters, including term-clusters fbl gLl¼1 and document-clusters fck gKk¼1 : U il corresponds to the degree to which term ti belongs to term-cluster bl : V jk corresponds to the degree document dj is associated with document-cluster ck .

2.4. Probabilistic interpretation of NMTF As in NMF, the probabilistic interpretation of NMTF is given in a similar manner but with a different probabilistic model. We further elaborate the work in (Li & Ding, 2006). We consider both term-clusters bl ’s and document-clusters ck ’s to properly factorize the joint probability pðti ; dj Þ as follows.

pðti ; dj Þ ¼

XX l

pðt i ; dj jbl ; ck Þpðbl ; ck Þ ¼

XX

k

l

pðt i jbl Þpðdj jck Þpðbl ; ck Þ;

ð6Þ

k

where pðbl ; ck Þ is the joint prior probability for term-cluster bl and document-cluster ck . A probabilistic model with the factorization (6) was shown to be associated with an optimal co-clustering in a sense that the loss in mutual information is minimized (Dhillon et al., 2003) Introducing sum-to-one normalization matrices, DU ¼ diagð1> UÞ and DV ¼ diagð1> VÞ, we re-write the 3-factor decomposition (5) as 1 > X ¼ ðUD1 U ÞðDU SDV ÞðVDV Þ : 1 Relating (6) to the 3-factor decomposition (5), marginal distributions pðt i jbl Þ and pðdj jck Þ are associated with UD1 U and VDV , respectively. The joint probability pðbl ; ck Þ corresponds to ðl; kÞ-element of DU SDV , representing the joint prior for term-cluster and document-cluster. For document clustering, we calculate the posterior probability of document-cluster given documents,

pðck jdj Þ / pðdj jck Þpðck Þ ¼ pðdj jck Þ

X

h i   > pðbl ; ck Þ ¼ diagð1> DU SDV ÞðD1 ¼ diagð1> DU SÞV > kj : V V Þ kj

l



>

Normalizing the matrix V using diagð1 DU SÞ, we assign document dj to document-cluster k if

    k ¼ arg max Vdiag 1> DU S jk :

ð7Þ

k

The posterior probability of term-cluster given terms is computed as

pðbl jti Þ ¼ pðt i jbl Þpðbl Þ ¼ pðti jbl Þ

X

h i pðbl ; ck Þ ¼ UD1 ¼ ½U diagðSDV 1Þil : U diagðDU SDV 1Þ

k

il



Thus, for clustering terms, we normalize the factor matrix U via post-multiplying it by diagðSDV 1Þ. We assign term ti to cluster l if 

l ¼ arg max½UdiagðSDV 1Þil : l

Algorithm outline: Document clustering by NMTF (1) Construct a term-document matrix X. (2) Apply NMTF to X, yielding X ¼ USV > . (3) Normalize U and V:

U

UdiagðSDV 1Þ;

V

Vdiagð1> DU SÞ;

where DU ¼ diagð1> UÞ and V ¼ diagð1> VÞ.  (4) Assign document dj to cluster k if 

k ¼ arg max V jk : k

ð8Þ

564

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570

3. Multiplicative algorithms We begin with multiplicative algorithm for NMF that we re-derive using an approach different from (Lee & Seung, 2001). Then we introduce our recent work (Yoo & Choi, 2008) where multiplicative updates for orthogonal NMF are derived directly from true gradient information on Stiefel manifolds. Finally, we adopt the same technique to develop orthogonality-preserving multiplicative updates for NMTF. 3.1. NMF algorithm We consider the squared Euclidean distance as a discrepancy measure between the data X and the model UV > , leading to the following least squares error function



1 kX  UV > k2 : 2

ð9Þ

NMF involves the following optimization:

arg min E ¼ UP0;VP0

1 kX  UV > k2 : 2

ð10Þ

Gradient descent learning (which is additive update) can be applied to determine a solution to (10), however, nonnegativity for U and V is not preserved without further operations at iterations. On the other hand, a multiplicative method developed in (Lee & Seung, 2001) provides a simple algorithm for (10). We give a slightly different approach from (Lee & Seung, 2001) to derive the same multiplicative algorithm. Suppose that the gradient of an error function has a decomposition that is of the form

rE ¼ ½rEþ  ½rE ; þ

ð11Þ



where ½rE > 0 and ½rE > 0. Then multiplicative update for parameters H has the form



H

H

½rE ½rEþ

:g ;

ð12Þ

where represents Hadamard product (elementwise product) and ð Þ:g denotes the elementwise power and g is a learning rate ð0 < g 6 1Þ. It can be easily seen that the multiplicative update (12) preserves the nonnegativity of the parameter H, while rE ¼ 0 when the convergence is achieved. Derivatives of the error function (9) with respect to U with V fixed and with respect to V with U fixed, are given by

rU E ¼ ½rU Eþ  ½rU E ¼ UV > V  XV;

ð13Þ

rV E ¼ ½rV Eþ  ½rV E ¼ VU > U  X > U:

ð14Þ

With these gradient calculations, the rule (12) with g ¼ 1 yields the well-known Lee and Seung’s multiplicative updates (Lee & Seung, 2001)

U V

XV ; UV > V X>U V ; VU > U U

ð15Þ ð16Þ

where the division is also an elementwise operation. 3.2. ONMF algorithm Orthogonal NMF involves a decomposition (1) as in NMF but requires that U or V satisfies the orthogonality constraint such that U > U ¼ I or V > V ¼ I (Choi, 2008). In this paper we consider the case where V > V ¼ I is incorporated into the optimization (10). In such a case, it was shown that orthogonal NMF is equivalent to k-means clustering in the sense that they share the same objective function (Ding, He, & Simon, 2005). In this section, we present a new algorithm for orthogonal NMF with V > V ¼ I where we incorporate the gradient on Stiefel manifold into multiplicative update. Orthogonal NMF with V > V ¼ I is formulated as the following optimization problem:

arg min E ¼ U;V

1 kX  UV > k2 2

subject to V > V ¼ I; U P 0; V P 0:

ð17Þ

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570

565

In general, the constrained optimization problem (17) is solved by introducing a Lagrangian with a penalty term trfKðV > V  IÞg where K is a symmetric matrix containing Lagrangian multipliers. Ding et al. (2006) took this approach with some approximation, developing multiplicative updates. Here we present a different approach, incorporating the gradient in a constraint surface on which V > V ¼ I is satisfied, into (12). With U fixed in (9), we treat (9) as a function of V. Minimizing (9) where V is constrained to the set of N  K matrices such that V > V ¼ I was well studied in (Smith, 1993; Edelman et al., 1998). Here we incorporate nonnegativity constraints on V to develop multiplicative updates with preserving the orthogonality constraint V > V ¼ I. The constraint surface which is the set of N  K orthonormal matrices such that V > V ¼ I is known as the Stiefel manifold (Stiefel, 1935– 1936). An equation defining tangents to the Stiefel manifold at a point V is obtained by differentiating V > V ¼ I, yielding

V > D þ D> V ¼ 0;

ð18Þ

>

i.e., V D is skew-symmetric. The canonical metric on the Stiefel manifold (Edelman et al., 1998) is given by

   1 g c ðD; DÞ ¼ tr D> I  VV > D ; 2

ð19Þ

whereas the Euclidean metric is given by

g e ðD; DÞ ¼ tr D> D :

ð20Þ

We define the partial derivatives of E with respect to the elements of V as

½rV Eij ¼

@E : @V ij

ð21Þ

For the function E (9) (with U fixed) defined on the Stiefel manifold, the gradient of E at V is defined to be the tangent vector e V E such that r

  

e V E; D ¼ tr ð r e V EÞ> I  1 VV > D ; g e ðrV E; DÞ ¼ tr ðrV EÞ> D ¼ g c r 2

ð22Þ

for all tangent vectors D at V. e V E such that V > r e V E is skew-symmetric yields Solving (22) for r

e V E ¼ rV E  V½rV E> V: r

ð23Þ

Thus, with partial derivatives in (14), the gradient in the Stiefel manifold is calculated as

e V E ¼ ðX > U þ VU > UÞ  VðX > U þ VU > UÞ> V ¼ VU > XV  X > U ¼ ½ r e V Eþ  ½ r e V E : r

ð24Þ

e V yields Invoking the relation (12) with replacing rV by r

V

V

X> U ; VU > XV

ð25Þ

which is our ONMF algorithm. The updating rule for U is the same as (15). 3.3. ONMTF algorithm Orthogonal NMTF (ONMTF) is formulated as the following optimization problem:

arg min J ¼ U;S;V

1 kX  USV > k2 2

subject to U > U ¼ I; V > V ¼ I; ðorthogonality constraintsÞ;

ð26Þ

U P 0; S P 0; V P 0; ðnonnegativity constraintsÞ: Invoking (23), true gradients on Stiefel manifolds, fUjU > U ¼ Ig; fVjV > V ¼ Ig; are computed as

e U Jþ  ½ r e U J e U J ¼ rU J  U ½rU J> U ¼ USV > X > U  XVS > ¼ ½ r r

ð27Þ

e V J ¼ rV J  V ½rV J> V ¼ VS > U > XV  X > US ¼ ½ r e V Jþ  ½ r e V J : r

ð28Þ

The standard gradient with respect to S is given by

rS J ¼ U > USV > V  U > XV ¼ ½rS Jþ  ½rS J :

ð29Þ

566

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570

Table 1 Multiplicative updating algorithms are summarized: (1) standard NMF (Lee & Seung, 2001); (2) nonnegative block value decomposition (NBVD) (Long et al., 2005); (3) Ding et al.’s orthogonal matrix tri-factorization (ONM3F) (Ding et al., 2006); and (4) our proposed orthogonal matrix tri-factorization (ONMTF). Model X  UV

Algorithm >

Update rule

NMF

X  USV >

NBVD

X  USV >

ONM3F

X  USV >

ONMTF

U U

>

XVS V U USV > VS > >

1

Þ:2 V U ðUUXVS > XVS > U

>

U UVXV> V V

U

>

XVS U USV V > > X U

X U V VU > U >

V VSX> UUS S > US 1

>

V ðVVX> XUS Þ :2 S > US >

V VSX> UUS S > XV

>

XV S U >UUSV > V 1

>

XV S ðU >UUSV Þ :2 > V >

XV S U >UUSV > V

Table 2 Document dataset details are summarized, where for each dataset, the number of classes, the number of documents, the number of terms and the maximum/ minimum cluster size are described. Datasets

# Classes

CSTR k1a k1b re0 re1 wap

4 20 6 13 25 20

# Documents

602 2340 2340 1504 1657 1560

# Terms

3791 13879 13879 2886 3758 8460

Cluster size Max

Min

214 494 1389 608 371 341

96 9 60 11 10 5

As in ONMF, we use the relation (12) with gradient calculations (27)–(29) to develop the following multiplicative algorithms

U V S

XVS > USV > X > U X > US V > > VS U XV U > XV : S > U USV > V U

ð30Þ ð31Þ ð32Þ

4. Experimental results We evaluated the performance of four different matrix factorization methods, NMF, NBVD, ONM3F, and ONMTF, in a task of document clustering. Models and associated multiplicative updates of these algorithms are summarized in Table 1. We begin with the description of the datasets used in our experiments, and then present clustering performance of the algorithms. We also provide experimental results demonstrating that our method is useful in revealing polysemous words via co-clustering terms and documents. 4.1. Datasets We used six datasets for document clustering, including CSTR, k1a, k1b, re0, re1, and wap. The statistics of the datasets are summarized in Table 2, and detailed descriptions are listed below:  CSTR is the dataset consisting of 600 technical reports collected from the list of computer science technical reports at the homepage of Department of Computer Science, University of Rochester.1 The documents are categorized by topics, i.e., AI, robotics, systems and theory.  k1a, k1b, wap are the datasets built for the WebACE project (Han et al., 1998). The datasets are included in the CLUTO2 clustering toolkit, consisting of web pages in various subject directories of Yahoo!.3 k1a and k1b actually have the same documents, but web pages in k1a categorized into more detailed clusters, involving a deeper directory hierarchy.  re0 and re1 are the subsets of Reuters-21578 dataset,4 which consists of news articles on the Reuters newswire in 1987. These datasets are also included in the CLUTO toolkit. 1 2 3 4

http://www.cs.rochester.edu/trs/. http://www.glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. http://www.yahoo.com/. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

567

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570

We removed stop words using the standard stop list. We used the terms occurring more than 2 documents, because the terms occurred in only 1 document have no information in grouping of multiple documents. The normalized-cut weighting (Xu et al., 2003), which is defined by

X

1

XdiagðX > X1Þ:2 :

for the input matrix X, is applied to the term-document matrices before applying document clustering algorithms. 4.2. Performance evaluation We evaluated the performance of four algorithms (NMF, NBVD, ONM3F, ONMTF) that are summarized in Table 1, in terms of clustering accuracy (CA) and normalized mutual information (NMI) using six datasets that are described in Table 2. Note that NMF adopts the 2-factor decomposition (1) for document clustering and other three algorithms (NBVD, ONM3F, ONMTF) consider matrix tri-factorization (5) for co-clustering. We assume that we know the number of clusters a priori, so the true number of clusters summarized in Table 2 is used in the algorithms. For the methods based on 3-factor decomposition, we set the number of term-clusters to be equal to the number of document-clusters. Two performance measures (CA and NMI) are computed as follows.  CA: We use the Kuhn–Munkres algorithm (Lovasz & Plummer, 1986) to find which cluster (computed by one of four algorithm considered here) gives a maximal match to a ground truth cluster (which is known in advance). Denote by e c j and cj the ground truth cluster and the matched cluster labels of the document dj , then CA is computed by

PN CA ¼

j¼1 dðc j ;

e cjÞ

N

;

where dðx; yÞ ¼ 1 for x ¼ y and otherwise dðx; yÞ ¼ 0. CA attains maximum value 1 when clustering result is perfect. e which correspond to the set of  NMI: NMI is based on the mutual information (MI) between two sets of clusters, C and C e j the set of documents grouped estimated clusters and the set of ground truth clusters, respectively. Denote by C i and C into cluster i (by one of algorithms in consideration here) and the set of documents in the ground truth cluster i, respece is calculated as tively. With these definitions, MI between C and C

e ¼ MIðC; CÞ

XX i

j

e e jj e jj X X jC i \ C NjC i \ C e j Þlog pðC i ; C j Þ ¼ log2 pðC i ; C : 2 e e N pðC i Þpð C j Þ jC i jj C j j i j

e where Hð Þ denotes the entropy, in order to the value of NMI ranges between 0 and 1, MI is normalized by maxðHðCÞ; Hð CÞ,

e ¼ NMIðC; CÞ

e MIðC; CÞ e maxðHðCÞ; Hð CÞÞ

:

Table 3 Clustering accuracies (CA), averaged over 100 trials. In each row, the highest CA is written in boldface. The results significantly worse than the best result is marked with . The statistical significance is measured by using the Wilcoxon rank-sum test with p-value 0.01.

CSTR k1a k1b re0 re1 wap

NMF

NBVD

ONM3F

ONMTF

0.6029* 0.4696 0.8319 0.3552 0.4892 0.4689

0.5775* 0.4653 0.7902* 0.3526 0.4712* 0.4601

0.7529 0.3784* 0.5594* 0.3435* 0.4058* 0.3706*

0.5434* 0.4748 0.8193 0.3611 0.4981 0.4764

Table 4 Normalized mutual information (NMI), averaged over 100 trials. In each row, the highest NMI is written in boldface. The results significantly worse than the best result is marked with . The statistical significance is measured by using the Wilcoxon rank-sum test with p-value 0.01.

CSTR k1a k1b re0 re1 wap

NMF

NBVD

ONM3F

ONMTF

0.3304* 0.5476 0.6972 0.2966 0.5169 0.5440

0.3029* 0.5511 0.6406* 0.3002 0.5063* 0.5416

0.5799 0.4533* 0.4418* 0.2822* 0.4703* 0.4552*

0.2547* 0.5166* 0.6462* 0.2990 0.5078 0.5110*

568

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570

To evaluate CA and NMI, we run each algorithm 100 times, starting with different initial conditions. Final CA and NMI values are computed by averaging over 100 independent trials. Table 3 summarizes the result of CA. ONMTF usually brought better CA than the other algorithms except CSTR and k1b datasets. ONM3F showed significantly better results than the other algorithms for the CSTR dataset, but showed significantly worse performance than ONMTF for the other datasets. Table 4 summarizes the result of NMI. NMF and NBVD usually showed better NMI values than ONM3F and ONMTF. NMI depends on the diversity of the whole clustering results, where CA only measures the portion of the dominantly matched items. In our experiments, ONMTF showed lower NMI for some datasets, which means the items not correspond to the dominantly matched cluster distributed arbitrary into the other clusters. Orthogonal NMF potentially brings better CA in the document clustering problems than NMF, but the two algorithms, ONM3F and ONMTF, showed complementary performance depending on the datasets. Moreover, the NMI results of the ONMTF are sometimes poorer than the algorithms not using orthogonality. From this observation, we suggest an ensemble of ONM3F and ONMTF, to improve the CA and NMI. We used the cluster-based similarity partitioning algorithm (CSPA) (Strehl & Ghosh, 2002) in computing an ensemble clustering solution for each matrix factorization method. This is carried out as follows. L 2 RNK is the cluster label matrix determined by one of matrix factorization methods, where ½Ljk ¼ pðck jdj Þ. We compute a similarity matrix W ðiÞ between documents, for trial i, as

W ðiÞ ¼ LðiÞ ½LðiÞ > : P Treating W ¼ i W ðiÞ as a weighted affinity matrix on a undirected graph where documents are associated with vertices, we apply a graph partitioning technique to determine an ensemble clustering solution. Clustering results of CA and NMI are summarized in Tables 5 and 6, respectively. We tested the ensemble clusterings with five different ensemble settings including: (1) an ensemble of 30 NMFs; (2) an ensemble of 30 NBVDs; (3) an ensemble of 30 ONM3Fs; (4) an ensemble of 30 ONMTFs; (5) an ensemble of 15 ONM3Fs + 15 ONMTFs. Among the 100 clustering results obtained from each algorithm, we randomly generated 100 subsets consists of 30 results to use in the ensemble clustering, and the resulting CA and NMI values are averaged over the subsets. The ensemble of ONM3Fs+ONMTFs showed better CA than the ensemble of NMFs, across all six datasets. Moreover, the ensemble of ONM3Fs+ONMTFs showed better NMI than the ensemble of NMF, except k1b dataset. The ensemble of the two orthogonality-based algorithms brought better performance than the ensemble of NMFs, for both of the CA and NMI measures. We compare two orthogonal methods, ONM3F and ONMTF, in terms of an orthogonality-preserving measure which is computed by kU > U  Ik and kV > V  Ik for U and V, respectively. We measured these quantities with respect to the number of iterations, and averaged for 100 trials with different initial conditions. Fig. 4 shows that our algorithm ONMTF better preserves the orthogonality for factor matrices U and V, compared to ONM3F, since ours uses directly the true gradient information on Stiefel manifold, where the orthogonality is preserved.

Table 5 Clustering accuracies (CA) for ensemble clustering in five settings: (1) an ensemble of 30 NMFs; (2) an ensemble of 30 NBVDs; (3) an ensemble of 30 ONM3Fs; (4) an ensemble of 30 ONMTFs; (5) an ensemble of 15 ONM3Fs + 15 ONMTFs. Better results than the ensemble of NMFs are written in boldface, and significantly better results than the ensemble of NMFS are marked with . The statistical significance is measured by using the Wilcoxon rank-sum test with p-value 0.01.

CSTR k1a k1b re0 re1 wap

NMFs

NBVDs

ONM3Fs

ONMTFs

ONM3F+ONMTF

0.5846 0.5223 0.8473 0.3444 0.5232 0.5169

0.5847 0.5417* 0.8401 0.3595* 0.5129 0.5092

0.7875* 0.5366* 0.7238 0.3805* 0.4740 0.5396*

0.5830 0.5409* 0.8467 0.3677* 0.5925* 0.5467*

0.6422* 0.5489* 0.8474 0.3811* 0.5623* 0.5497*

Table 6 Normalized mutual information (NMI) for ensemble clustering in five settings: (1) an ensemble of 30 NMFs; (2) an ensemble of 30 NBVDs; (3) an ensemble of 30 ONM3Fs; (4) an ensemble of 30 ONMTFs; (5) an ensemble of 15 ONM3Fs + 15 ONMTFs. Better results than the ensemble of NMFs are written in boldface, and significantly better results than the ensemble of NMFS are marked with . The statistical significance is measured by using the Wilcoxon rank-sum test with pvalue 0.01.

CSTR k1a k1b re0 re1 wap

NMFs

NBVDs

ONM3Fs

ONMTFs

ONM3F+ONMTF

0.2972 0.5485 0.7375 0.2991 0.5344 0.5472

0.2996* 0.5696* 0.7167 0.3207* 0.5440* 0.5536*

0.6579* 0.5595* 0.5663 0.3141* 0.5388* 0.5604*

0.2971 0.5460 0.6890 0.3151* 0.5655* 0.5584*

0.4026* 0.5665* 0.7101 0.3147* 0.5757* 0.5779*

569

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570 0.30

Orthogonality

ONM3F ONMTF

Orthogonality

10

0.29

10

ONM3F ONMTF

0.30

10

0.29

10 0

400

800

0

Number of Iterations

400

800

Number of Iterations

Fig. 4. The orthogonality measure for U and V, computed by (a) kU > U  Ik and (b) kV > V  Ik, as the iterations proceed.

Table 7 The factor matrix S in the middle of the 3-factor decomposition USV > determined by ONMTF, is shown in its transpose form. Six term-clusters ðL ¼ 6Þ and 4 document clusters ðK ¼ 4Þ are denoted by TC1 - TC6 and DC1 - DC4. Each of TC1 - TC4 is associated with one of document-clusters. TC5 and TC6 are associated with two of document-clusters, respectively. In other words, TC5 and TC6 contain polysemous words.

DC1: DC2: DC3: DC4:

Theory Systems Robotics AI

TC1

TC2

TC3

TC4

TC5

TC6

0.2355 0.0000 0.0000 0.0000

0.0000 0.4954 0.0013 0.0000

0.0053 0.0079 0.1284 0.0296

0.0320 0.0255 0.0126 1.8626

0.0467 1.3484 0.0291 0.3912

0.7539 0.0437 0.0097 0.1035

Table 8 Ten most probable words in each term-cluster are summarized, when ONMTF is applied. TC1 theory

TC2 systems

TC3 robotics

TC4 AI

TC5 AI + Systems

TC6 AI + theory

Truth Proved Loly np Pspace Unambiguous Boolean Reducible Polynomial Imply

Compiler Optimization Wait Busy Latency Balancing Block Kernel Protocol Access

Clustering Extremely Make Rule Discovery Previous Pattern Traversal Cluster Repeated

Vision State Knowledge Speech Language Approach Feature Human Tasks World

Technique Implementation Coherence Programming Results Network Provide Communication Scale Algorithms

Complexity Length Number Turing Problem Log Property Probabilistic Function Bound

4.3. Co-clustering and polysemous word identification Co-clustering simultaneously determines term-clusters as well as document-clusters, leading us to identify how termclusters are related to document-clusters. Most probable words in term-clusters reveal topics as well. A word referring to several different meanings depending on contexts, is called a polysemous word. For example, the word data used with the words memory, software, and cache, represents a part of computer systems. However, if the word data is used with the words learning, recognition, and visual, then it most likely represents training data used in computer vision or robotics systems. In one-sided clustering such as NMF or PLSI, we can determine the context of a document-cluster by examining the corresponding topic factors (that have the highest probability to generate a specific word in consideration) (Hofmann, 1999a). A word appearing in several topic factors (column vectors of U in the 2-factor decomposition) is a polysemous word and the corresponding document-clusters indicate different contexts. However, this examination requires additional processing because one-sided clustering does not provide term-clusters. In the co-clustering with ONMTF, we set L > K, i.e., choosing the larger number of term-clusters than document-clusters so that polysemous words may appear in different term-clusters. The structure of the factor matrix S leads us to identify terms-clusters involving polysemous words. Table 7 shows an exemplary S determined by ONMTF using the CSTR dataset, with L ¼ 6 and K ¼ 4. Term-cluster 1, 2, 3, and 4 (denoted by TC1 through TC4) are associated with one of document-clusters (denoted by DC1 through DC4). On the other hand, TC5 involves DC2 and DC4, related to topics AI and Systems. TC6 is associated with DC1 and DC4, related to topics AI and Theory. Ten most probable words in each term-cluster are summarized in Table 8. TC 1 through TC4 contain words representing topics ‘Theory’, ‘Systems’, ‘Robotics’, and ‘AI’, respectively. On the other hand, TC5 and TC6 contain words related to ‘AI+Sys-

570

J. Yoo, S. Choi / Information Processing and Management 46 (2010) 559–570

tems’ and ‘AI+Theory’, respectively. For example, the words in TC5, such as network and communication can be used in both DC1 (AI) and DC4 (Systems). Thus, the examination of S enables us to easily detect polysemous words and corresponding contexts, which is an advantage of ONMTF over NMF. 5. Conclusions We have presented multiplicative updates for orthogonal nonnegative matrix tri-factorization, referred to as ONMTF, applying it to a task of co-clustering words and documents. We have developed our ONMTF algorithm, directly exploiting true gradient information on Stiefel manifolds. Thus, our algorithm preserved orthogonality better than an existing tri-factorization algorithm ONM3F (Ding et al., 2006). Experiments on 6 document datasets, matrix tri-factorization indeed improved the performance of document clustering in terms of clustering accuracy. We further improved the clustering performance, in terms of clustering accuracy and normalized mutual information, by constructing ensembles of homogeneous (e.g., ONMTFs) or heterogeneous (e.g., ONM3Fs+ONMTFs) matrix factorizations. We also confirmed that co-clustering with ONMTF was useful in identifying polysemous words, which was done by examining the structure of the middle factor matrix S in the 3-factor decomposition USV > . The extension of the proposed multiplicative algorithm based on the projected gradient approaches (Zdunek & Cichocki, 2008) will be an interesting future work for dealing with large-scale problems. Acknowledgments This work was supported by Korea Research Foundation Grant (KRF-2008-313-D00939), Korea Ministry of Knowledge Economy under the ITRC support program supervised by the National IT Industry Promotion Agency (NIPA-2009-C10900902-0045), WCU Program (Project No. R31-2008-000-10100-0), and Microsoft Research Asia. References Choi, S. (2008). Algorithms for orthogonal nonnegative matrix factorization. In Proceedings of the international joint conference on neural networks (IJCNN). Hong Kong. Dhillon, I. S., Mallela, S., & Modha, D. S. (2003). Information-theoretic co-clustering. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (pp. 89–98). Ding, C., He, X., & Simon, H. D. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM international conference on data mining (SDM) (pp. 606–610). Newport Beach, CA. Ding, C., Li, T., Peng, W., & Park, H. (2006). Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Philadelphia, PA. Ding, C., Li, T., & Peng, W. (2008). On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis, 52, 3913–3927. Edelman, A., Arias, T., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM Journal of Matrix Analysis and Application, 20(2), 303–353. Gaussier, E., & Goutte, C. (2005). Relation between PLSA and NMF and implications. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (SIGIR). Salvador, Brazil. Han, E. H., Boley, D., Gini, M., Gross, R., Hastings, K., & Karypis, et al. (1998). WebACE: A web agent for document categorization and exploration. In Proceedings of the Second International Conference on Autonomous Agents (pp. 408–415). Minneapolis, Minnesota. Hofmann, T. (1999a). Probablistic latent semantic analysis. In Proceedings of the annual conference on uncertainty in artificial intelligence (UAI). Hofmann, T. (1999b). Probablistic latent semantic indexing. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (SIGIR). Hofmann, T., Puzicha, J., & Jordan, M. I. (1999). Learning from dyadic data. Advances in neural information processing systems NIPS (Vol. 11). MIT Press. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. Advances in neural information processing systems (NIPS) (Vol. 13). MIT Press. Li, T., & Ding, C. (2006). The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the IEEE international conference on data mining (ICDM). Hong Kong. Long, B., Zhang, Z., & Yu, P.S. (2005). Co-clustering by block value decomposition. In Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD). Chicago, IL. Lovasz, L., & Plummer, M. (1986). Matching theory. Akademiai Kiado. Shahnaz, F., Berry, M., Pauca, P., & Plemmons, R. (2006). Document clustering using nonnegative matrix factorization. Information Processing and Management, 42, 373–386. Smith, S.T. (1993). Geometric optimization methods for adaptive filtering. Ph.D. thesis, Harvard University. Stiefel, E. (1935–1936). Richtungsfelder und fernparallelismus in n-dimensionalem mannig faltigkeiten. Commentarii Mathematici Helvetici, 8, 305–353. Strehl, A., & Ghosh, J. (2002). Cluster ensembles – a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Toronto, Canada. Yoo, J., & Choi, S. (2008). Orthogonal nonnegative matrix factorization: Multiplicative updates on Stiefel manifolds. In Proceedings of the 9th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL). Daejeon, Korea. Zdunek, R., & Cichocki, A. (2008). Fast nonnegative matrix factorization algorithms using projected gradient approaches for large-scale problems. Computational Intelligence and Neuroscience, 2008.