Penalized nonnegative matrix tri-factorization for co-clustering

Penalized nonnegative matrix tri-factorization for co-clustering

Expert Systems With Applications 78 (2017) 64–73 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.el...

720KB Sizes 0 Downloads 49 Views

Expert Systems With Applications 78 (2017) 64–73

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

Penalized nonnegative matrix tri-factorization for co-clustering Shiping Wang a,b, Aiping Huang c,∗ a

College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, China Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350116, China c Tan Kah Kee College, Xiamen University, Zhangzhou 363105, China b

a r t i c l e

i n f o

Article history: Received 11 November 2016 Revised 22 January 2017 Accepted 23 January 2017 Available online 9 February 2017 Keywords: Machine learning Co-clustering Unsupervised learning Nonnegative matrix factorization

a b s t r a c t Nonnegative matrix factorization has been widely used in co-clustering tasks which group data points and features simultaneously. In recent years, several proposed co-clustering algorithms have shown their superiorities over traditional one-side clustering, especially in text clustering and gene expression. Due to the NP-completeness of the co-clustering problems, most existing methods relaxed the orthogonality constraint as nonnegativity, which often deteriorates performance and robustness as a result. In this paper, penalized nonnegative matrix tri-factorization is proposed for co-clustering problems, where three penalty terms are introduced to guarantee the near orthogonality of the clustering indicator matrices. An iterative updating algorithm is proposed and its convergence is proved. Furthermore, the high-order nonnegative matrix tri-factorization technique is provided for symmetric co-clustering tasks and a corresponding algorithm with proved convergence is also developed. Finally, extensive experiments in six real-world datasets demonstrate that the proposed algorithms outperform the compared state-of-the-art co-clustering methods. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction Clustering is one of the most important research topics in unsupervised machine learning. It is the problem of categorizing unlabeled data points into groups based on their similarities, which is evaluated by certain metrics. Due to the efficiency and fast convergence, clustering has been widely used in data mining, image processing, computer vision and biological engineering. Clustering algorithms can be divided into constraint-based methods and metric-based methods. The former aims to add certain constraints based on fundamental clustering algorithms to capture other vital perspectives, such as fuzzy K-means (Dunn, 1973; Pedrycz & Waletzky, 1997; Prasad, Lin, Lin, Er, & Prasad, 2015), sparse Kmeans (Kondo, Matias, & Zamar, 2012), sparse subspace clustering (Elhamifar & Vidal, 2013), sparse fuzzy K-means (Qiu, Qiu, Feng, & Li, 2015) and scalable sparse subspace clustering (Peng, Zhang, & Yi, 2013). The latter is to learn certain effective metrics, not limited to matrix norms, to evaluate sample similarities, such as kernel K-means (Dhillon, Guan, & Kulis, 2004; Yu, Tranchevent, Liu, & Glanzel, 2012).



Corresponding author. E-mail addresses: [email protected], [email protected] (A. Huang).

http://dx.doi.org/10.1016/j.eswa.2017.01.019 0957-4174/© 2017 Elsevier Ltd. All rights reserved.

A large number of clustering algorithms have been proposed up to now. Some typical representatives are K-means (Duda, Hart, & Stork, 20 0 0), spectral clustering (Izquierdo-Verdiguier, Jenssen, Gmez-Chova, & Camps-Valls, 2015; Ng, Jordan, & Weiss, 2001) and density-based clustering (Beauchemin, 2015; Huang, Zhang, Song, & Chen, 2015a). It is noted that density-based clustering is one type of methods to cluster data points via the probability distribution evaluation, such as Gaussian mixture model (He, Cai, Shao, Bao, & Han, 2011). In recent years, nonnegative matrix factorization (Wang, Pedrycz, Zhu, & Zhu, 2015; Wang & Zhu, 2016) has received much more attention from machine learning and computer vision. It provides numerous problem formulation techniques and algorithmic methods for clustering problems, especially constraintbased clustering. For example, Ding et al. proved that K-means and spectral clustering could be expressed as certain forms of nonnegative matrix factorization (Ding, He, & Simon, 2005; Li & Ding, 2006). However, the techniques mentioned above consider only oneside clustering, which groups similar objects. Co-clustering can categorize data points and features simultaneously, which has an extensive range of applications (Papalexakis, Sidiropoulos, & Bro, 2012). For example, in text mining, co-clustering can cluster similar documents and topics in two-way modes (Rege, Dong, & Fotouhi, 2006). In recommendation systems, co-clustering can divide a user-movie input matrix into user clusters and movie

S. Wang, A. Huang / Expert Systems With Applications 78 (2017) 64–73

groups simultaneously (Konstas, Stathopoulos, & Jose, 2009). Furthermore, co-clustering has also been widely applied to gene expression (Brameier & Wiuf, 2007; Liu, Gu, Hou, Han, & Ma, 2014), identification of interaction networks (Luo, Liu, Cao, & Wang, 2016; Pio, Ceci, Loglisci, D’Elia, & Malerba, 2012), collaborative filtering (George & Merugu., 2005; Khoshneshin & Street, 2010), image processing (Chen, Dong, & Wan, 2009; Hanmandlu, Verma, Susan, & Madasu, 2013; Vitaladevuni & Basri, 2010) and social data mining (Bao, Min, Lu, & Xu, 2013; Giannakidou, Koutsonikola, Vakali, & Kompatsiaris, 2008). However, due to the orthogonality constraints of co-clustering problems, how to develop efficient algorithms is always a tough task (Pompili, Gillis, Absil, & Glineur, 2014). Though there have been some research works on co-clustering algorithms, most existing works (Buono & Pio, 2015; Huang, Wang, Li, Yang, & Li, 2015b; Liu, Li, Lin, Wu, & Ji, 2015; Wang, Nie, Huang, & Ding, 2011) relax the orthogonality constraints as nonnegativity, which may deteriorate the clustering performance. There are also some works to address orthogonality constraints using approximate strategies. For example, Ding, Li, Peng, and Park (2006) proposed a method to guarantee near orthogonality via choosing suitable Lagrangian multipliers. However, it often falls into expectation-maximization solutions, which does not mean the convergence. Therefore, this motivates us to address general co-clustering problems from an algorithmic level. In this paper, the co-clustering problem is formulated as penalized nonnegative matrix tri-factorization problems, and two efficient iterative algorithms with proved convergence are proposed. First, co-clustering is expressed as matrix tri-factorization with dual orthogonality constraints, in which two-way clustering results are indicated by two indicator matrices. These two matrices are constrained with orthogonality and nonnegativity which increase the difficulty to optimize the problem. Second, penalty terms are presented to approximately deal with high-order orthogonality constraints. Using the technique, co-clustering is formulated as quadratic nonnegative matrix factorization and the corresponding efficient iterative algorithm is designed. Third, a symmetric penalized nonnegative matrix tri-factorization is also defined. Simultaneously, an important technique is proposed to deal with high-order nonnegative matrix factorization. Incorporating this technique and penalty terms, we put forward with an efficient co-clustering algorithm. Finally, the proposed algorithms are compared with five state-of-the-art clustering methods. Experimental results demonstrate the superiority of the proposed algorithms in six real-world datasets from machine learning repository. The rest of this paper is arranged as follows. In Section 2, we formulate the co-clustering problem as penalized nonnegative matrix tri-factorization and develop an efficient iterative algorithm with proved convergence. In Section 3, we use symmetric penalized nonnegative matrix tri-factorization to represent clustering problems, and present an important technique to address highorder nonnegative matrix factorization problems. In Section 4, experimental results and analyses are provided. Finally, this paper is concluded in Section 5. 2. Penalized nonnegative matrix tri-factorization for co-clustering In this section, co-clustering problems are formulated as penalized nonnegative matrix factorization, an efficient iterative algorithm in a batch mode is proposed and its convergence is proved. 2.1. Problem formulation Given the data matrix X ∈ Rd×n where d is the number of features and n is the number of samples, denote the sample space as X = {x1 , . . . , xn } where xi ∈ Rd , and the feature space as

65

F = { f1 , . . . , fd } where f j ∈ Rn . It is noted that X = (x1 , . . . , xn ) = ⎛ ⎞ f1 ⎜ ⎟ ⎜ . ⎟ ⎜ . ⎟. The clustering problem is to partition the sample space ⎝ . ⎠ fd into c clusters {Ci }ci=1 , whereas the co-clustering problem is to c

1 group the sample space {x1 , . . . , xn } into c1 clusters {Ci }i=1 as well 2 {D j }cj=1 .

as the feature space { f1 , . . . , fd } into c2 clusters In order to transform discrete K-means optimization problems into continuous ones, a clustering matrix G ∈ Rn×c is introduced to represent the clustering results, where Gi j =  1 if sample xi belongs to clus|C j |

ter Cj and Gi j = 0 otherwise (Ding et al., 2005). It is easily verified that GT G = I where I ∈ Rc×c is a unit matrix. Using the clustering matrix, the clustering problem is approximately formulated as the following optimization problem:

min S,G

1 ||X − SGT ||2F 2

s.t. G ≥ 0, GT G = I

(1)

where S ∈ Rd×c is a coefficient matrix and G ∈ Rn×c is a clustering matrix. Analogously, the co-clustering problem is written as the following expression:

min F,S,G

1 ||X − F SGT ||2F 2

s.t. F ≥ 0, G ≥ 0, F T F = I, GT G = I Rn×c2

Rd×c1 ,

(2) Rc1 ×c2

where F ∈ G∈ are indicator matrices and S ∈ is a coefficient matrix. Here, ||◦||F represents the Frobenius Norm, i.e., for any matrix X = (Xi j )d×n ∈ Rd×n ,



||X ||F =

n

d

12

Xi2j

.

(3)

j=1 i=1

2.2. Penalized nonnegative matrix tri-factorization It is known that K-means is NP-hard, however the Lloyd–Max iteration (Lloyd, 1982) usually yields satisfactory solutions at an affordable complexity. The co-clustering problem is a generalization of K-means, hence it is also NP-hard. Unlike K-means, there is no algorithm analogous to the Lloyd–Max iteration for co-clustering. In order to alleviate the difficulty of the co-clustering problem, we introduce three penalty terms to replace with the two orthogonality constraints GT G = I and F T F = I. The co-clustering problem is transformed into the following optimization objective function:

min F,S,G

1 α β γ ||X − F SGT ||2F + T r (F F T ) + T r (G GT ) + T r (ST S ) 2 2 2 2 s.t. F ≥ 0, G ≥ 0 (4)

where  ∈ and  ∈ are two penalized matrices to guarantee the orthogonality (or near orthogonality) of F and G, Tr(ST S) is to keep F and G not too small, and α , β , γ are used to balance the three terms. Denote F = (F1 , . . . , Fc1 ) where Fi ∈ Rd . We specify the penalized matrix as Rc1 ×c1



⎜ ⎜ =⎜ ⎜ ⎝

0 1 .. . 1

1 0 .. . 1

Rc2 ×c2

... ... ...

1 1 .. . 0

It is observed that

T r (F F T ) = T r (F T F ) =



⎟ ⎟ ⎟. ⎟ ⎠

i= j

(5)

FiT Fj ,

(6)

66

S. Wang, A. Huang / Expert Systems With Applications 78 (2017) 64–73

which is minimized by an orthogonal vector group {F1 , . . . , Fc1 } if the condition that each column of F being not equal to zero is imposed. This non-zero condition can be adjusted by the term Tr(ST S).

Si j ←− Si j [

( F T X G )i j 1 ]2 (F T F SGT G + γ S )i j

(17)

2.3. Algorithm for co-clustering In this subsection, we use the Lagrange multiplier method to solve the penalized nonnegative matrix factorization. Considering the fact that the input data matrix such as image data is often nonnegative, we add a nonnegativity constraint for the coefficient matrix S, then the penalized nonnegative matrix factorization is rewritten as follows:

min F,S,G

1 α β γ ||X − F SGT ||2F + T r (F F T ) + T r (G GT ) + T r (ST S ) 2 2 2 2 (7)

Three Lagrange multipliers λ ∈ μ∈ and ν ∈ are employed for the constraints F ≥ 0, S ≥ 0 and G ≥ 0, respectively. The Lagrange function L is constructed as

L(F , S, G, λ, μ ) =

Rc1 ×c2

Rn×c2

1 α β ||X − F SGT ||2F + T r (F F T ) + T r (G GT ) 2 2 2 +

γ

2

T r (ST S ) + T r (λF T ) + T r (μST ) + T r (ν GT ).

Denote T r (A ) =

i=1

Aii for any square matrix A = (Ai j )n×n ∈ Rn×n .

Using the matrix property T r (A ) = T r (AAT ), we observe

T r ( AT )

and

||A||2F

=

T r ( AT A )

||X − F SGT ||2F = tr ((X − F SGT )T (X − F SGT )) = T r (X T X ) − 2T r (X T F SGT ) + T r (GST F T F SGT ).

=

(9)

Taking derivatives of L with respect to F, S and G, we have

∂L = −X GST + F SGT GST + α F  + λ ∂F

(10)

∂L = −F T X G + F T F SGT G + γ S + μ ∂S

(11)

∂L = −X T F S + GST F T F S + β G + ν. ∂G

(12)

Karush–Kuhn–Tucker conditions indicate the first-order necessity to be satisfied when non-linear optimization problems with inequality constraints are optimized. Using the KKT conditions, we know λi j Fi j = 0, μi j Si j = 0 and νi j Gi j = 0 which imply the following three equations:

(−X GST + F SGT GST + α F )i j Fi j = 0

(13)

(−F T X G + F T F SGT G + γ S )i j Si j = 0

(14)

(−X T F S + GST F T F S + β G )i j Gi j = 0.

(15)

Fi j ←− Fi j

(X GST )i j (F SGT GST + α F )i j

Based on the above analyses, an algorithm for co-clustering via penalized nonnegative matrix tri-factorization is summarized as follows.

In this subsection, the correctness and convergence of the aforementioned updating rules in Eqs. (16), (17) and (18) are proved. We start with the definition of the auxiliary function, which plays an essential role in the convergence proof of updating rules with multiple variables. Definition 1. (Auxiliary function (Razaviyayn, Hong, & Luo, 2013)) A function Z(h, h ) is called an auxiliary function of J(h) if it satisfies the following four conditions: for any h and h ,

∂ Z (h, h ) dJ (h ) |h=h = | . ∂h dh h=h

(19)

It is noted that the auxiliary function in Lee and Seung (2001) just requires a tight upper bound of the objective function. Definition 1 adds two first order conditions which guarantee the first order derivative of Z(h, h ) is the same as that of J(h). The following lemma provides a technique to prove the convergence of an algorithm with multiple variables. Lemma 1 (Razaviyayn et al. (2013)). If Z(h, h ) is an auxiliary function of J(h), then J(h) is non-increasing under the updating rule

h(t+1) = arg min Z (h, h(t ) ).

(20)

h

Lemma 2 (Ding, Li, & Jordan (2010)). For any matrices A ∈ Rn+×n , B ∈ Rk+×k , S ∈ Rn+×k , S ∈ Rn+×k , and A, B are symmetric, then the following inequality holds n

k (AS B ) S2

ij ij i=1 j=1

Si j

≥ T r (ST ASB ).

(21)

For simplicity, we denote the optimization objective function as J (F , S, G ) = 12 ||X − F SGT ||2F + α2 T r (F F T ) + β2 T r (G GT ) + γ T 2 T r (S S ). The following theorem constructs an auxiliary function of the penalized nonnegative matrix tri-factorization problem with respect to variable F while fixing variables S and G. Theorem 1. Denote

Problem 7 is convex in only F, S or G, while it is not convex in the three variables together. Therefore, it is solved in the sense of local minima by optimizing each variable successively. Borrowing from an algorithmic framework of multiplicative updating rules for nonnegative quadratic problems (Lee & Seung, 2001), the above three equations lead to the following three rules:

(18)

Z (h, h ) ≥ J (h ), Z (h, h ) = J (h ), Z (h, h ) is continuous in (h, h ),

(8) n

( X T F S )i j 1 ]2 . (GST F T F S + β G )i j

2.4. Correctness and convergence analysis

s.t. F ≥ 0, S ≥ 0, G ≥ 0. Rd×c1 ,

Gi j ←− Gi j [

 12 (16)

J (F ) = −T r (X T F SGT ) +

1 α T r (GST F T F SGT ) + T r (F F T ). 2 2

(22)

The following function

Z (F , F  ) = −

c1

d

(X GST )i j Fij (1 + log

j=1 i=1

+

Fi j ) Fij

c1

c1

d (F  SGT GST ) F 2 d (F  ) F 2 ij ij ij ij 1

α

+  2 Fi j 2 Fij j=1 i=1

j=1 i=1

(23)

S. Wang, A. Huang / Expert Systems With Applications 78 (2017) 64–73

is an auxiliary function of J(F). Furthermore, Z(F, F ) is convex in F and its global minima is

Fi j = arg min Z (F , F  ) = Fij Fi j

Proof. See Appendix A.

(X GST )i j (F  SGT GST + α F  )i j

 12 .

(24)

c2

c1

j=1 i=1

+

Input: Data matrix X ∈ Rd×n , the numbers of the clusters c1 , c2 , andparameters ,  , α , β and γ . 1 Output: Two clustering labels {Ci }ci=1 and 1:

1 γ T r (GST F T F SGT ) + T r (ST S ). 2 2

2:

(25)

Initialize F , S and G; while not converged do (XGST )

1

Fix S, G and update F by Fi j ←− Fi j [ (F SGT GST +αi jF ) ] 2 ;

3:

ij

(F T XG )

1

Fix F , G and update S by Si j ←− Si j [ (F T F SGT G+iγj S ) ] 2 ;

4:

The following function

Z (S, S ) = −

Algorithm 1 Penalized nonnegative matrix tri-factorization for coclustering (PNMT).

2 {Di }cj=1 .



Theorem 2. Denote

J (S ) = −T r (X T F SGT ) +

67

ij

S (F X G )i j Si j (1 + log i j T

Si j

)

1

ij

6:

c1

c2 c1

c2 (F T F S GT G )i j Si2j γ

1

+ Si2j .  2 Si j 2 j=1 i=1

(X T F S )

Fix F , S and update G by Gi j ←− Gi j [ (GST F T F S+βi Gj  ) ] 2 ;

5:

7:

(26)

end while Denote the feature space as F = { f1 , . . . , fd } and the sample space as X = {x1 , . . . , xn }.Then fi ∈ C j if Fi j = max Fik , and xi ∈ D j k

if Gi j = max Gik .

j=1 i=1

k

is an auxiliary function of J(S). Furthermore, Z(S, S ) is convex in S and its global minima is

Si j = arg min Z (S, S ) = Si j [ Si j

Proof. See Appendix B.

( F T X G )i j 1 ]2 . T ( F F S  GT G + γ S  )i j

3.1. Problem formulation for symmetric co-clustering

(27)



Theorem 3. Denote

J (F ) = −T r (X T F SGT ) +

1 β T r (GST F T F SGT ) + T r (G GT ). 2 2

(28)

In some cases, the relational matrix of the sample space (or feature space) is easy to obtain, whereas the specific values of the sample space are difficult to evaluate. Assuming that R ∈ Rn×n is the relational matrix of the sample space X = {x1 , . . . , xn }, the goal of co-clustering is to group the sample space using sample similarity matrix. The symmetric co-clustering problem is written as the following optimization objective function:

The following function

Z (G, G ) = −

c1

d

j=1 i=1

+

min G,S

G (X T F S )i j Gi j (1 + log i j ) Gi j

We also use the symmetric penalized nonnegative matrix factorization technique to deal with the above co-clustering problem. Hence the co-clustering problem is rewritten as follows:

j=1 i=1

(29) is an auxiliary function of J(G). Furthermore, Z(G, G ) is convex in G and its global minima is

Gi j = arg min Z (G, G ) = Gi j [ Gi j

Proof. See Appendix C.

( X T F S )i j 1 ]2 . ( G S T F T F S + β G  )i j

(31)

s.t. G ≥ 0, GT G = I.

c1

c1

d (GST F T F S ) G2 d ( G  ) G2 ij ij ij ij 1

β

+  2 Gi j 2 Gi j j=1 i=1

1 ||R − GSGT ||2F 2

(30)



Based on the above three theorems, we can prove the convergence of Algorithm 1. Theorem 4. Because Algorithm 1 falls into the BSUM framework (Razaviyayn et al., 2013), every limit point of the iterates generated is a stationary point of the Problem 7. Proof. The iterates generated by the BSUM algorithm converge to the set of stationary points (Razaviyayn et al., 2013), which implies the convergence.  3. Symmetric penalized nonnegative matrix tri-factorization for co-clustering In this section, symmetric penalized nonnegative matrix trifactorization is proposed for co-clustering problems. In particular, an important technique is presented to deal with high-order nonnegative matrix factorization and the corresponding algorithm with convergence is developed.

min G,S

1 β ||R − GSGT ||2F + α T r (GGT ) + T r (ST S ) 2 2

(32)

s.t. G ≥ 0, S ≥ 0. 3.2. Algorithm In this subsection, the Lagrange multiplier method is used to solve the symmetric penalized nonnegative matrix factorization. Lagrange multipliers λ ∈ Rn×c and μ ∈ Rc×c are introduced for the constraints G ≥ 0 and S ≥ 0. The Lagrange function L is constructed as follows:

L(G, S, λ, μ ) =

1 ||R − GSGT ||2F + α T r (GGT ) 2 +

β 2

T r (ST S ) + T r (λGT ) + T r (μST ).

(33)

Taking derivatives of L with respect to G and S, we know

∂L = −2RGS + 2GSGT GS + 2α G + λ ∂G

(34)

∂L = −GT RG + GT GSGT G + β S + μ ∂S

(35)

where S is regarded as a symmetric matrix. According to the Karush–Kuhn–Tucker (KKT) conditions, we have λi j Gi j = 0 and μi j Si j = 0, which lead to

(−2RGS + 2GSGT GS + 2α G)i j Gi j = 0

(36)

68

S. Wang, A. Huang / Expert Systems With Applications 78 (2017) 64–73 Table 1 Data description.

(−GT RG + GT GSGT G + β S )i j Si j = 0.

(37)

The above two equations result in the following two updating rules:

Gi j ←− Gi j

Si j ←− Si j

(RGS )i j (GSGT GS + α G)i j

(GT RG )i j T (G GSGT G + β S )i j

 14

(38)

DID

dataset

# instances

# features

# classes

Data types

1 2 3 4 5 6

COIL ISOLET BASEHOCK CNAE PCMAC TOX

1440 1560 1993 1080 1943 171

1024 617 4862 856 3289 5748

20 26 2 9 2 4

Object image Speech database Text document Text document Gene expression Gene expression

 12 .

(39)

Based on the above analyses, an algorithm for co-clustering problems based on symmetric penalized nonnegative matrix trifactorization is summarized as follows.

is an auxiliary function of O(S). Furthermore, Z is convex in S and its global minima is

Si j = arg min Z (S, S ) = Si j [ Si j

Proof. See Appendix E.

3.3. Correctness and convergence analysis

(

(GT RG )i j 1 ]2 . + β S  )i j

(46)

GT GS GT G



Theorem 7. Because Algorithm 2 falls into the BSUM frameIn this subsection, we verify the correctness and convergence of the symmetric penalized nonnegative matrix tri-factorization. First of all, a lemma is introduced to provide a tight upper bound for four-order matrix traces. Lemma 3. For any matrices A ∈ Rk+×k , B ∈ Rk+×k and H ∈ Rn+×k , the following inequality holds: for any H  ∈ Rn+×k , k

n

(H  AH T H  B + H  BH T H  A )i j Hi4j T r (H AH T H BH T ) ≤ . 2 Hi3j j=1 i=1

Algorithm 2 Symmetric penalized nonnegative matrix trifactorization for co-clustering (SPNMT). Input: Relational matrix R ∈ Rn×n , the number of the clusters c, andparameters , α and β . Output: Clustering label {Ci }ci=1 . 1: 2: 3:

Initialize G and S; while not converged do

(RGS )

1

Fix S and update G by Gi j ←− Gi j [ (GSGT GS+αi jG) ] 4 ; ij

(40)

4:

(GT RG )

1

Fix G and update S by Si j ←− Si j [ (GT GSGT G+iβj S ) ] 2 ; ij

Based on the above lemma, we construct auxiliary functions for each algorithmic variables.

1 T r (GST GT GSGT ) + α T r (GGT ). 2

O(G ) = −T r (RGSGT ) + The following function

n

c

c

n

l=1 k=1 j=1 i=1

+

6:

end while Denote the sample space as X = {x1 , . . . , xn }.Then xi ∈ C j if Gi j = max Gik . k

Theorem 5. Denote

Z (G, G ) = −

5:



Glk Gi j Ril Sk j Glk Gi j 1 + log   Glk Gi j

(41)



work (Razaviyayn et al., 2013), every limit point of the iterates generated is a stationary point of the Problem 32. Proof. The iterates generated by the BSUM algorithm converge to the set of stationary points (Razaviyayn et al., 2013), which implies the convergence. 

 T  4 c n 1 (G SG G S )i j Gi j  3 2 Gi j

4. Experiments

j=1 i=1

+

4 4 c n 1 (G)i j (Gi j + Gi j )  3 2 Gi j

(42)

j=1 i=1

is an auxiliary function of O(G). Furthermore, Z(G, G ) is convex in G and its global minima is

Gi j = arg min Z (G, G ) = Gi j G ij

Proof. See Appendix D.

(RG S )i j   T (G SG G S + α G )i j

4.1. Datasets

 14 .

(43)



Theorem 6. Denote

O(S ) = −T r (RGSGT ) +

1 T r (GST GT GSGT ) + α T r (ST S ). 2

(44)

Then the following function

Z (S, S ) = −

c

c

(GT RG )i j Si j (1 + log

j=1 i=1

Si j ) Si j

In our experiments, six publicly machine learning datasets are used to verify the performance of the proposed algorithm. These datasets are derived from different sources, including object image, speech data, text documents and gene expression, which facilitates to enrich the diversity of experimental scenes. They provide a good test bed for comprehensive performance evaluation of different clustering algorithms. The detailed information of datasets is summarized in Table 1. 4.2. Experimental setting

T  T 2 c c c

c

1 (G GS G G )i j Si j + + α Si2j . 2 Si j j=1 i=1

In this section, comprehensive experiments on real-world datasets from machine learning repository are conducted to validate the effectiveness and efficiency of the proposed algorithm for co-clustering problems.

j=1 i=1

(45)

In order to validate the effectiveness and efficiency of the proposed algorithms, the following five state-of-the-art clustering algorithms have been compared.

S. Wang, A. Huang / Expert Systems With Applications 78 (2017) 64–73

n

ACC =

i=1

δ ( pi , map(qi ))

(47)

n

where δ (a, b) = 1 if a = b; otherwise δ (a, b) = 0. Here map(◦) is the best permutation mapping function that matches the obtained clustering label to the ground truth label of the dataset. One of the best mapping functions can refer to the Kuhn–Munkres algorithm (Lovasz & Plummer, 2009). The higher the ACC is, the better the clustering performance is. Given two random variables P and Q, NMI of P and Q is defined as

NMI (P, Q ) =



I (P ; Q )

(48)

H ( P )H ( Q )

where I(P; Q) is the mutual information of P and Q, and H(P) and H(Q) are the entropies of P and Q, respectively (Gray, 2011). Here, c  = {C } the clustering results C i i=1 of and the ground truth labels C = {C j }cj=1 of all samples are viewed as two discrete random variables. NMI is specified as

c c ) = NMI (C, C



i=1

(

c

j=1

 i=1 |Ci |log



|Ci ∩ C j |log n|C|C||i ∩CC j|| i

|Ci | n

)(

c

j

|C j | j=1 |C j |log n )

.

(49)

It is noteworthy that  c and c are not necessarily equal. The higher the NMI is, the better the clustering performance is. There are some parameters to be set in advance. Due to the sensitivity of initial values of most clustering algorithms, experiments are repeated 20 times, and their mean and standard deviation are reported. For co-clustering algorithms, the number of clusters to partition samples is tuned as that of clusters to group features. For all tested clustering algorithms, the number of clusters is set as the number of classes provided by label information. As to different compared clustering algorithms, their parameters are set as default values. In the construction of manifold learning of DRCC and DNMTF, the numbers of nearest neighbors for data graph and feature graph are set as 11 and the weight mode is fixed as binary values. The algorithmic parameters of DRCC are tuned as λ = 1 and μ = 1. The regularization parameters of DNMTF are set as λ = 200

Comparison of runmes of different methods 90 80 70

Runme (s)

1. K-means: K-means clustering algorithm is one of the most widely used unsupervised learning methods. Its considerable advantage is that it can converge to optimal values in a fast speed, however, it is much sensitive to initial values and noisy data (Duda et al., 20 0 0). 2. FCM: Fuzzy C-means (also called fuzzy K-means) is an extension of K-means. It can assign a membership degree belonging to clusters to a sample, which implies that a data point may belong to two or more clusters, not just only one cluster (Pedrycz, 1993). 3. SNMF: Sparse nonnegative matrix factorizarion (SNMF) introduced the sparsity constraints on the coefficient matrix and formulated the co-clustering problem with L1 -norm regularization, which is optimized by alternating nonnegative least squares (Kim & Park, 2008). 4. DRCC: Dual regularized co-clustering (DRCC) formulates the co-clustering problem as semi-nonnegative matrix tri-factorization, which embeds the geometric structure of data manifold and feature manifold (Gu & Zhou, 2009). 5. DNMTF: Graph dual regularization non-negative matrix trifactorization (DNMTF) uses nonnegative matrix tri-factorization to describe co-clustering problems and presents an iterative updating scheme (Shang, Jiao, & Wang, 2012). In order to evaluate the clustering performance of different clustering methods in a sound manner, two evaluation metrics including clustering accuracy (ACC) and normalized mutual information (NMI) are presented. Given a sample xi ∈ {xi }ni=1 , denote pi and qi as the truth label and the prediction clustering label, respectively. The ACC is defined as follows:

69

60 50 40 30 20 10 0 COIL PNMT

ISOLET SPNMT

BASEHOCK K-means

FCM

CNAE SNMF

PCMAC DRCC

TOX

DNMTF

Fig. 1. Runtime of different clustering methods in all tested datasets.

and μ = 200. The regularization parameters of PNMT are fixed as α = 1, β = 1 and γ = 1. The regularization weights of SPNMT are tuned as α = 1 and β = 1. Finally, the maximum numbers of iterations for DRCC, DNMTF, PNMT and SPNMT are set as 30. 4.3. Result and analysis In this subsection, the compared performance of different clustering algorithms is presented. The clustering accuracy and normalized mutual information are listed in Tables 2 and3, respectively. The proposed two co-clustering algorithms PNMT and SPNMT are compared with three single-way clustering algorithms Kmeans, FCM and SNMF, and two co-clustering algorithms DRCC and DNMTF. From these two tables, we have the following observations. On one hand, co-clustering comes with better performance than the compared state-of-the-art single-way clustering algorithms in most tested datasets. However, single-way clustering also shows its advantages over co-clustering, which may be explained by the fact that the class label information for features is unavailable and the number of clusters for features is regarded as a latent factor. Due to the uncertainty of label information for features, we set the number of clusters for features as the number of clusters for samples, which may deteriorate the clustering performance. On the other hand, the proposed algorithms demonstrate superiority over the compared state-of-the-art methods. In particular, the advantage of the proposed algorithm PNMT is relatively considerable in the dataset ISOLET. The runtime list of different clustering methods is demonstrated in Fig. 1. From the figure, we have the following observations. Different clustering methods exhibit varying efficiency as the numbers of samples and features increase. For example, the proposed method SPNMT performs fast in those datasets (e.g. TOX) less samples and more features, while the runtime of PNMF is positively related to both sample number and feature number. 4.4. Parameter sensitivity In this subsection, the parameter sensitivity of the proposed algorithms is analyzed. Due to the limited space, we report only the clustering accuracy of the proposed algorithm PNMT with respect to regularization parameters α and β while keeping γ = 1, as shown in Fig. 2. We use grid searching strategy to tune the values of α and β , ranging in {10−4 , 10−3 , . . . , 104 }. From the figure, it is observed that appropriate α and β can result in better performance of the proposed algorithm, which has a close relation to the initialization. Relatively small α and β may ignore the orthogonality of the clustering indicator matrices in co-clustering problems, but relatively big α and β may lead to distorted orthogonality and lower the clustering performance. In fact, excessive α and β may

70

S. Wang, A. Huang / Expert Systems With Applications 78 (2017) 64–73 Table 2 Clustering accurcy (ACC% ± std%) of different clustering algorithms on different datasets. The best results are highlighted in bold (The higher the better). Dateset

COIL

PNMT SPNMT K-means FCM SNMF DRCC DNMTF

58.5 56.8 54.6 6.6 47.0 53.6 39.9

ISOLET ± ± ± ± ± ± ±

3.8 5.5 4.7 3.8 2.8 4.1 3.9

61.2 55.8 53.5 9.8 44.0 52.5 29.6

± ± ± ± ± ± ±

BASEHOCK 2.1 3.2 2.5 2.5 2.0 3.3 7.5

55.4 57.0 50.2 52.4 55.4 50.7 51.4

± ± ± ± ± ± ±

0.3 3.6 0.1 3.4 3.3 0.7 0.9

CNAE 51.9 41.4 21.0 40.9 53.1 33.7 46.7

PCMAC

± ± ± ± ± ± ±

7.9 6.8 6.6 5.9 2.7 7.2 5.6

57.0 54.1 50.6 51.1 51.6 50.4 51.4

TOX

± ± ± ± ± ± ±

4.5 2.6 0.1 0.1 0.7 0.2 1.3

46.1 42.4 40.1 37.5 45.6 40.9 43.3

± ± ± ± ± ± ±

1.9 2.8 2.3 6.5 2.9 5.2 5.8

Table 3 Normalized mutual information (NMI% ± std%) of different clustering algorithms on different datasets. The best results are highlighted in bold (The higher the better). Dateset

COIL

PNMT SPNMT K-means FCM SNMF DRCC DNMTF

75.3 74.0 73.3 41.3 60.0 74.5 56.9

ISOLET ± ± ± ± ± ± ±

1.8 3.0 2.3 4.2 1.4 2.1 3.3

77.1 75.4 71.6 40.5 60.4 75.2 70.0

± ± ± ± ± ± ±

1.0 2.2 1.3 1.0 1.4 1.8 2.4

BASEHOCK

CNAE

± ± ± ± ± ± ±

51.2 39.4 22.8 43.2 50.2 33.2 50.8

1.5 2.0 1.4 0.5 1.5 1.2 1.0

1.0 1.2 0.8 0.3 0.2 0.9 0.1

± ± ± ± ± ± ±

PCMAC 1.5 0.6 1.2 4.5 1.5 5.1 4.5

2.1 0.8 1.0 0.5 0.3 0.5 0.6

± ± ± ± ± ± ±

1.2 0.4 0.1 0.1 0.1 0.2 0.2

TOX 24.9 24.4 22.5 11.7 21.6 12.4 23.0

± ± ± ± ± ± ±

3.9 5.5 3.0 5.2 2.7 5.3 4.2

Fig. 2. The clustering accuracy of the proposed algorithm PNMT w.r.t. parameters α and β while fixing parameters γ = 1. Here, α and β are reported with grid search strategy, varying in {10−4 , 10−3 , . . . , 104 }.

let the proposed algorithm produce some empty clusters, which deteriorates the clustering performance, as illustrated evidently in (b) and (d) of Fig. 2. Generally, regularization parameters α and β ranging in certain region can facilitate the proposed algorithm to achieve favorable clustering performance. 5. Conclusion and further work In this paper, two efficient co-clustering algorithms via nonnegative matrix tri-factorization were proposed. The proposed algorithms addressed more general co-clustering problems, where the orthogonality of the clustering matrix was guaranteed. Due to the high-order orthogonality constraints, we incorporated an effective technique and proposed penalized nonnegative matrix tri-factorization. Simultaneously, we presented symmetric penal-

ized nonnegative matrix tri-factorization to deal with co-clustering problems when the input data were abstracted as sample similarity matrix, where a direct technique was proposed to address highorder constraints, not limited to orthogonality constraints. Finally, comprehensive experiments in six real-world datasets demonstrate the superiority of the proposed algorithms. Using the method with embedded penalty terms, we will explore more efficient algorithms for orthogonal nonnegative matrix factorization problems in our future work.

Acknowledgments This work is supported in part by the National Natural Science Foundation of China under Grant No. 61502104.

S. Wang, A. Huang / Expert Systems With Applications 78 (2017) 64–73

71

Appendix A. Proof of Theorem 1

Appendix B. Proof of Theorem 2

Proof. First of all, we will prove that Z(F, F ) in Eq. (23) is an auxiliary function of J(F) in Eq. (22). It is observed that

Proof. Analogous with the proof of Theorem 2, we first prove that Z(S, S ) in Eq. (26) is an auxiliary function of J(S) in Eq. (25). It is noted that

J (F ) = −T r (X T F SGT ) +

= −T r (SGT X T F ) +

1 α T r (GST F T F SGT ) + T r (F F T ) 2 2

1 α T r (F T F SGT GST ) + T r (F T F ). 2 2

(50)

J (S ) = −T r (X T F SGT ) +

(51)

= −T r (GT X T F S ) +

In order to obtain an upper bound of −T r (SGT X T F ), we introduce the following inequality: for any z > 0,

z ≥ 1 + log(z ).

(52)

1 γ T r (ST F T F SGT G ) + T r (ST S ). 2 2

T r ( GT X T F S ) =

c2

c1

T r (SGT X T F ) =

(X GST )i j Fi j =

j=1 i=1



c1

d

(X GST )i j Fij

j=1 i=1

c1

d



(X GST )i j Fij 1 + log

j=1 i=1

Fi j Fij





Fi j Fij

.

T r (F F ) ≤ T

Fij

j=1 i=1

(54)

Fij

c1

d (F  ) F 2

ij ij

.

(55)

1 α T r (GST F T F SGT ) + T r (F F T ) 2 2 c1

c  T T 2 d d



F 1 1 (F SG GS )i j Fi j ≤ (X GST )i j Fij (1 + log ij ) +  Fi j 2 Fi j

J (F ) = −T r (X T F SGT ) +

j=1 i=1

j=1 i=1

c1

d (F  ) F 2 ij ij α

j=1 i=1

Fij

which implies J(F) ≤ Z(F, F ) for any F and F . Furthermore, it is straightforward that Z (F , F ) = J (F ) and Z(F , F) is continuous on (F, F ). Taking derivatives of Z with respect to Fij , i.e., we have

Fij Fi j Fi j ∂ Z (F , F  ) = −(X GST )i j + (F  SGT GST )i j  + α (F  )i j  . ∂ Fi j Fi j Fi j Fi j

(56)

dJ (F ) = −X GST + F SGT GST + α F  dF

(57)

(X GST )i j Fij (F  SGT GST + α F  )i j ∂ 2 Z (F , F  ) = δip δ jq ( + ) ∂ Fi j ∂ Fpq Fij Fi2j

(58)

where δxy = 1 if and only if x = y; otherwise, δip = 0. It is observed that the Hassian matrix of Z is a diagonal matrix with positive elements, which implies that Z is a convex function and its ) global minima can be obtained by setting ∂ Z∂(F,F = 0. Therefore, F 

T r (ST F T F SGT G ) ≤

 (61)

c2

c1

(F T F S GT G )i j Si2j

(62)

Si j

Si j Si j ∂ Z (S, S ) = − ( F T X G )i j + ( F T F S  GT G )i j  + γ Si j . ∂ Si j Si j Si j

(63)

dJ (S ) = −F T X G + F T F SGT G + γ S dS

(64)

 which implies ∂ Z∂(SS,S ) |S =S = ∂∂JS(S ) |S =S , so Z(S, S ) in Eq. (26) is ij ij ij ij ij ij an auxiliary function of J(S) in Eq. (25). In order to search the minima of Z(S, S ), we compute the Hassian matrix of Z(S, S ), i.e.,

 T  (F X G )i j Si j (F T F S GT G )i j ∂ 2 Z (S, S ) = δip δ jq + +1 ∂ Si j ∂ S pq Fij Si2j

(65)

It is evident that the Hassian matrix of Z is a diagonal matrix with positive elements, which indicates that Z is convex and its  global minima can be computed by fixing ∂ Z∂(SS,S ) = 0. Therefore, Eq. (63) implies Eq. (27).

ij



Appendix C. Proof of Theorem 3 Proof. Evidenced with the proof of Theorem 1.



Appendix D. Proof of Theorem 5

) which implies ∂ Z∂(F,F |Fi j =F  = ∂∂J (FF ) |Fi j =F  , so Z(F, F ) is an auxiliary Fi j ij ij ij function of J(F). The Hassian matrix of Z(F, F ) is

Eq. (56) implies Eq. (24).

ij

Si j 1 + log  Si j

The derivative of J(S) is

Hence

2

( F X G )i j

S

which imply J(S) ≤ Z(S, S ), Z (S, S ) = J (S ) and Z(S, S ) is continuous in (S, S ). The derivative of Z(S, S ) with respect to S is

c1

d (F  SGT GST ) F 2

ij ij j=1 i=1

 T

j=1 i=1

Using Lemma 2, we can obtain

T r (F T F SGT GST ) ≤

c2

c1

j=1 i=1

(53)

(60)

( F T X G )i j Si j

j=1 i=1 c1

d

(59)

Using Lemma 2 and the inequality z > 1 + log(z ) for any z > 0, we know

Therefore,

+

1 γ T r (GST F T F SGT ) + T r (ST S ) 2 2

ij

Proof. We begin with the proof that Z(G, G ) in Eq. (42) is an auxiliary function of O(G) in Eq. (41). We find each upper bound of each term of

O(G ) = −T r (RGSGT ) +

1 T r (GST GT GSGT ) + α T r (GGT ). 2

(66)

By the inequality z > 1 + log(z ) for any z > 0, we know

T r (RGSGT ) =

n

c

c

n

Ril Sk j Glk Gi j ≥

l=1 k=1 j=1 i=1 n

c

c

n

l=1 k=1 j=1 i=1

 Ril Sk j Glk Gi j

Glk Gi j 1 + log   Glk Gi j

 .

(67)

72

S. Wang, A. Huang / Expert Systems With Applications 78 (2017) 64–73

Using Lemma 3 and the inequality ab ≤ ≥ 0, we have

T r (GST GT GSGT ) ≤

for any a ≥ 0 and b

c

n (G SGT G S ) G4

ij ij

(68)

Gi3j

j=1 i=1

c

n (G) G2

ij ij

T r (GGT ) = T r (GT G) ≤

Gi j

j=1 i=1



a2 +b2 2

4 4 c n 1 (G)i j (Gi j + Gi j ) .  3 2 Gi j

(69)

j=1 i=1

Collecting all the above bounds, it is proved that

Z (G, G ) = −

n

c

c

n

l=1 k=1 j=1 i=1

+

Glk Gi j Ril Sk j Glk Gi j (1 + log   ) Glk Gi j

 T  4 c n 1 (G SG G S )i j Gi j 2 Gi3j j=1 i=1

+

4 4 c n 1 (G)i j (Gi j + Gi j ) 2 Gi3j

(70)

j=1 i=1

is a tight upper bound of O(G). The derivatives of Z(G, G ) with respect to Gij are

G3i j G3i j Gi j ∂ Z (G, G ) = −2(RG S )i j + 2(G SGT G S )i j 3 + 2(G)i j 3 . ∂ Gi j Gi j Gi j Gi j (71) Taking derivative of O(G), we have

dO(G ) = −2RGS + 2GSGT GS + 2G dG which implies

∂ Z (G,G ) ∂ Gi j

|Gi j =G = ij

(72)

∂ O (G )  ∂ Gi j |Gi j =Gi j , so Z(G, G ) is an auxil-

iary function of O(G). Next, we will find the local optimal value of O(G). The Hassian matrix of Z(G, G ) is

 Gi j ∂ 2 Z (G, G ) = δip δ jq 2(RG S )i j 2 ∂ Gi j ∂ G pq Gi j

G2i j G2i j +6(G SGT G S )i j 3 + 6(G)i j 3 ). Gi j Gi j

(73)

Because the Hassian matrix of Z is a diagonal matrix with positive elements, Z is convex in G and its global minima is obtained by ∂ Z (G,G ) = 0, which implies ∂G ij

Gi j = arg min Z (G, G ) = Gi j [ Gi j

(RG S )i j

(G SGT G S + α G )

1

]4 .

(74)

ij

 Appendix E. Proof of Theorem 6 Proof. Evidenced with the proof of Theorem 2.



References Bao, B.-K., Min, W., Lu, K., & Xu, C. (2013). Social event detection with robust high-order co-clustering. International conference on multimedia retrieval. ACM. Beauchemin, M. (2015). A density-based similarity matrix construction for spectral clustering. Neurocomputing, 151, 835–844. Brameier, M., & Wiuf, C. (2007). Co-clustering and visualization of gene expression data and gene ontology terms for saccharomyces cerevisiae using self-organizing maps. Journal of Biomedical Informatics, 40, 160–173. Buono, N. D., & Pio, G. (2015). Non-negative matrix tri-factorization for co-clustering: An analysis of the block matrix. Information Sciences, 301, 13–26.

Chen, Y., Dong, M., & Wan, W. (2009). Image co-clustering with multi-modality features and user feedbacks. In ACM international conference on multimedia (pp. 689–692). Dhillon, I. S., Guan, Y., & Kulis, B. (2004). Kernel k-means, spectral clustering and normalized cuts. In ACM SIGKDD conference on knowledge discovery and data mining (pp. 551–556). Ding, C., He, X., & Simon, H. D. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering. In SIAM international conference on data mining (pp. 606–610). Ding, C., Li, T., & Jordan, M. I. (2010). Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 45–55. Ding, C., Li, T., Peng, W., & Park, H. (2006). Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 126–135). (20 0 0). In R. O. Duda, P. E. Hart, & D. G. Stork (Eds.), Pattern classification. Wiley-Interscience. Dunn, J. (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3, 32–57. Elhamifar, E., & Vidal, R. (2013). Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, 2765–2781. George, T., & Merugu. , S. (2005). A scalable collaborative filtering framework based on co-clustering. In IEEE international conference on data mining (pp. 625–628). Giannakidou, E., Koutsonikola, V. A., Vakali, A., & Kompatsiaris, Y. (2008). Co-clustering tags and social data sources. In International conference on web-age information management (pp. 317–324). Gray, R. M. (2011). Entropy and information theory. Springer. Gu, Q., & Zhou, J. (2009). Co-clustering on manifolds. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 359–368). Hanmandlu, M., Verma, O. P., Susan, S., & Madasu, V. (2013). Color segmentation by fuzzy co-clustering of chrominance color features. Neurocomputing, 120, 235–249. He, X., Cai, D., Shao, Y., Bao, H., & Han, J. (2011). Laplacian regularized gaussian mixture model for data clustering. IEEE Transactions on Knowledge and Data Engineering, 23, 1406–1418. Huang, G., Zhang, J., Song, S., & Chen, Z. (2015a). Maximin separation probability clustering. In Association for the advancement of artificial intelligence (pp. 2680–2686). Huang, S., Wang, H., Li, D., Yang, Y., & Li, T. (2015b). Spectral co-clustering ensemble. Knowledge-Based Systems, 84, 46–55. Izquierdo-Verdiguier, E., Jenssen, R., Gmez-Chova, L., & Camps-Valls, G. (2015). Spectral clustering with the probabilistic cluster kernel. Neurocomputing, 149, 1299–1304. Khoshneshin, M., & Street, W. N. (2010). Incremental collaborative filtering via evolutionary co-clustering. In Proceedings of the fourth ACM conference on recommender systems (pp. 325–328). Kim, J., & Park, H. (2008). Sparse nonnegative matrix factorization for clustering. In Georgia institute of technology-technical report GT-CSE-08-01 (pp. 1–15). Kondo, Y., Matias, S.-B., & Zamar, R. (2012). A robust and sparse k-means clustering algorithm. Statistics - Machine Learning, 1, 1–20. Konstas, I., Stathopoulos, V., & Jose, J. M. (2009). On social networks and collaborative recommendation. In Proceedings of the 32nd international ACM SIGIR conference on research and development (pp. 195–202). Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (pp. 556–562). Li, T., & Ding, C. (2006). The relationships among various nonnegative matrix factorization methods for clustering. In IEEE international conference on data mining (pp. 362–371). Liu, W., Li, S., Lin, X., Wu, Y., & Ji, R. (2015). Spectral-spatial co-clustering of hyperspectral image data based on bipartite graph. Multimedia Systems, DOI 10.1007/s00530-015-0450-0, 1–12. Liu, Y., Gu, Q., Hou, J. P., Han, J., & Ma, J. (2014). A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression. BMC Bioinformatics, 15, 93–106. Lloyd, S. P. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 28, 129–137. Lovasz, L., & Plummer, M. D. (2009). Matching theory. American Mathematical Society. Luo, J., Liu, B., Cao, B., & Wang, S. (2016). Identifying miRNA-mRNA regulatory modules based on overlapping neighborhood expansion from multiple types of genomic data. In International conference on intelligent computing (pp. 234–246). Springer International Publishing. Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In Advances in neural infomation processing systems (pp. 849–856). Papalexakis, E. E., Sidiropoulos, N. D., & Bro, R. (2012). From k-means to higher-way co-clustering: Multilinear decomposition with sparse latent factors. IEEE Transactions on Signal Processing, 61, 493–506. Pedrycz, W. (1993). Fuzzy control and fuzzy systems. Taunton, UK: Research Studies Press. Pedrycz, W., & Waletzky, J. (1997). Fuzzy clustering with partial supervision. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 27, 787–795. Peng, X., Zhang, L., & Yi, Z. (2013). Scalable sparse subspace clustering. In IEEE conference on computer vision and pattern recognition (pp. 430–437).

S. Wang, A. Huang / Expert Systems With Applications 78 (2017) 64–73 Pio, G., Ceci, M., Loglisci, C., D’Elia, D., & Malerba, D. (2012). Hierarchical and overlapping co-clustering of mRNA: miRNA interactions. In European conference on artificial intelligence (pp. 654–659). Pompili, F., Gillis, N., Absil, P.-A., & Glineur, F. (2014). Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. Neurocomputing, 141, 15–25. Prasad, M., Lin, Y., Lin, C., Er, M., & Prasad, O. (2015). A new data-driven neural fuzzy system with collaborative fuzzy clustering mechanism. Neurocomputing, 167, 558–568. Qiu, X., Qiu, Y., Feng, G., & Li, P. (2015). A sparse fuzzy c-means algorithm based on sparse clustering framework. Neurocomputing, 157, 290–295. Razaviyayn, M., Hong, M., & Luo, Z.-Q. (2013). A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23, 1126–1153. Rege, M., Dong, M., & Fotouhi, F. (2006). Co-clustering documents and words using bipartite isoperimetric graph partitioning. In IEEE international conference on data mining (pp. 532–541).

73

Shang, F., Jiao, L., & Wang, F. (2012). Graph dual regularization non-negative matrix factorization for co-clustering. Pattern Recognition, 45, 2237–2250. Vitaladevuni, S. N., & Basri, R. (2010). Co-clustering of image segments using convex optimization applied to em neuronal reconstruction. In IEEE conference on computer vision and pattern recognition (pp. 2203–2210). Wang, H., Nie, F., Huang, H., & Ding, C. (2011). Nonnegative matrix tri-factorization based high-order co-clustering and its fast implementation. In 2011 IEEE 11th international conference on data mining (pp. 774–783). Wang, S., Pedrycz, W., Zhu, Q., & Zhu, W. (2015). Subspace learning for unsupervised feature selection via matrix factorization. Pattern Recognition, 48, 10–19. Wang, S., & Zhu, W. (2016). Sparse graph embedding unsupervised feature selection. IEEE Transactions on Systems, Man, and Cybernetics: Systems. Yu, S., Tranchevent, L., Liu, X., & Glanzel, W. (2012). Optimized data fusion for kernel k-means clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34, 1031–1039.