Low-rank nonnegative matrix factorization on Stiefel manifold

Low-rank nonnegative matrix factorization on Stiefel manifold

Low-rank Nonnegative Matrix Factorization on Stiefel Manifold Journal Pre-proof Low-rank Nonnegative Matrix Factorization on Stiefel Manifold Ping H...

9MB Sizes 0 Downloads 59 Views

Low-rank Nonnegative Matrix Factorization on Stiefel Manifold

Journal Pre-proof

Low-rank Nonnegative Matrix Factorization on Stiefel Manifold Ping He, Xiaohua Xu, Jie Ding, Baichuan Fan PII: DOI: Reference:

S0020-0255(19)31112-0 https://doi.org/10.1016/j.ins.2019.12.004 INS 15047

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

5 July 2019 28 November 2019 2 December 2019

Please cite this article as: Ping He, Xiaohua Xu, Jie Ding, Baichuan Fan, Low-rank Nonnegative Matrix Factorization on Stiefel Manifold, Information Sciences (2019), doi: https://doi.org/10.1016/j.ins.2019.12.004

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Inc.

Low-rank Nonnegative Matrix Factorization on Stiefel Manifold Ping Hea , Xiaohua Xua,∗, Jie Dingb , Baichuan Fana a

b

Department of Computer Science, Yangzhou University, Yangzhou 225009, China China Institute of FTZ Supply Chain, Shanghai Maritime University, Shanghai 201306, China

Abstract Low rank is an important but ill-posed problem in the development of nonnegative matrix factorization (NMF) algorithms because the essential information is often encoded in a low-rank intrinsic data matrix, whereas noise and outliers are contained in a residue matrix. Most existing NMF approaches achieve low rank by directly specifying the dimensions of the factor matrices. A few others impose the low rank constraint on the factor matrix and use the alternating direction method of multipliers to solve the optimization problem. In contrast to previous approaches, this paper proposes a novel method for low-rank nonnegative matrix factorization on a Stiefel manifold (LNMFS), which utilizes the low rank structure of intrinsic data and transforms it into a Frobenius norm of the latent factors. To obtain orthogonal factors as distinct patterns, we further impose orthogonality constraints by assuming that the basis matrix lies on a Stiefel manifold. In addition, to improve the robustness of the data in a manifold structure, we incorporate the ∗

Corresponding author: Department of Computer Science, Yangzhou University, Yangzhou 225009, e-mail: [email protected]

Preprint submitted to Elsevier

December 3, 2019

graph smoothness of the coefficient matrix. Finally, we develop an efficient alternative iterative algorithm to solve the optimization problem and provide proof of its convergence. Extensive experiments on real-world datasets demonstrate the superiority of the proposed method compared with other representative NMF-based algorithms. Keywords: Nonnegative matrix factorization, Low rank, Orthogonal Constraints, Stiefel manifold, Graph smoothness

1

2

1. Introduction With the rapid development of data collection techniques, high-dimensional

3

data, such as online documents, electronic books, photographs, video series in

4

social medias, and health data, are becoming increasingly common. Accord-

5

ingly, generalized computational methods are required for extracting intrinsic

6

low-dimensional information. Although great success has been achieved by

7

various dimension-reduction methods [10], such as singular value decomposi-

8

tion (SVD) [21], principal component analysis [19], and independent compo-

9

nent analysis [9], few of them can provide a straightforward physical meaning

10

for the decomposition results.

11

In recent years, nonnegative matrix factorization (NMF) [42, 49] has at-

12

tracted increasing attention owing to its straightforward interpretability of

13

nonnegative results [22]. Technically, NMF seeks to identify a product of

14

two nonnegative matrices that provides a good approximation to the original

15

matrix. Therefore, the main difference of NMF from the other dimension re-

16

duction methods (e.g., SVD) is that NMF allows only non-subtractive combi-

17

nations of nonnegative components. This nonnegativity constraint eventually 2

18

leads to the parts-based representation of NMF.

19

Various methods have been proposed to extend the standard NMF by con-

20

sidering different aspects in additional constraints. One of the most widely

21

used constrained NMF methods is sparse NMF (SNMF) [14, 17, 31]. SNMF

22

introduces a weighted penalty term to enhance sparsity in the factor ma-

23

trices, leading to a local-based representation. In addition to sparsity con-

24

straints, another method for achieving sparsity is orthogonal NMF (ONMF)

25

[12, 23, 25]. ONMF removes the redundancy in data representation by im-

26

posing the orthogonality constraints on the factor matrices because under

27

the condition of non-negativity, orthogonality constraints necessarily result

28

in sparseness. As the orthogonal constraint set is a submanifold of the Stiefel

29

manifold, Choi et al. [8, 16] perform ONMF by applying the natural gra-

30

dient method in a Stiefel submanifold to achieve lower computational cost.

31

To promote exact orthogonality, Zhang et. al. [45] exploit the sparsity of

32

non-negative orthogonal solutions to divide the global problem into a series

33

of local optimizations. As semi-NMF becomes popular, Zhang et. al. [48]

34

remove the non-negativity constraints on the basis matrix and develop an ef-

35

ficient orthogonal semi-NMF algorithm over a Stiefel manifold. To maintain

36

the local geometrical structure in the low-dimensional data representation,

37

manifold NMF models [39, 7, 47, 44] incorporate the manifold structure into

38

the NMF framework. A representative manifold NMF model is graph reg-

39

ularized NMF (GNMF) [7], which penalizes the graph smoothness term in

40

the standard NMF model. In addition to these non-probabilistic NMF mod-

41

els, there are also probabilistic NMF approaches, which are based on, for

42

example, variational Bayesian inference [4] and Gibbs sampling. However, as

3

43

real-world data are often contaminated, different algorithms have been de-

44

veloped to address the problem of noise and outliers. Robust NMF [46][50]

45

(RNMF) assumes that the corrupted entries are sparse and enforces a spar-

46

sity constraint on the residual matrix. Based on RNMF, F´evotte et al. [15]

47

adopt a group-sparse outlier residual term to handle possible nonlinear ef-

48

fects (GRNMF). In recent years, low rank constraint has received growing

49

attention in the development of robust matrix decomposition methods [41]

50

because the essential data information is often encoded in the low-rank in-

51

trinsic data matrix, whereas unexpected noise and outliers are encoded in

52

a residue matrix. However, existing low-rank NMF models [28] impose the

53

low rank constraint on the outer product of the basis matrix rather than the

54

intrinsic data, which is less intuitive. Additionally, they adopt the alternat-

55

ing direction method of multipliers (ADMM) to solve the objective function,

56

which introduces higher computational cost into the optimization of NMF.

57

In this study, we propose a novel low-rank NMF learning method on the

58

Stiefel manifold (LNMFS) that imposes three additional constraints on the

59

standard NMF. First, LNMFS adds a low rank constraint on the intrinsic

60

data by penalizing its nuclear norm. To simplify the optimization, we trans-

61

form the nuclear norm of the intrinsic data matrix into a convex Frobenius

62

norm of the latent factors based on a well-known theorem [33]. Second, for

63

obtaining distinct patterns for easier interpretation, LNMFS assumes the

64

basis matrix lying on a Stiefel manifold, implying that different factors are

65

orthogonal to each other. Third, to improve the robustness for the data in a

66

manifold structure, LNMFS incorporates the graph smoothness constraint of

67

the coefficient matrix. The proposed method is formulated as an optimization

4

68

problem solved by an iterative multiplicative algorithm. We analyze the time

69

complexity of the proposed optimization algorithm and prove the theoretical

70

convergence of the optimization scheme. Extensive experimental results on

71

clustering problems demonstrate the effectiveness of the proposed approach

72

in comparison with nine representative NMF-based algorithms. The superior

73

performance of LNMFS on noisy datasets indicates its noise robustness. In

74

addition, the empirical convergence performance of LNMFS is demonstrated

75

by its rapid converge on the experimental datasets. Furthermore, we dis-

76

cuss the influence of the regularization parameters as well as the embedding

77

dimension on the clustering performance of LNMFS.

78

79

There are several aspects of the proposed approach that should be mentioned:

80

1. In real-world applications, the collected data are often contaminated by

81

noise and outliers. To reduce the interference of noise and discover the

82

underlying factors, LNMFS utilizes the low-rank structure of the intrin-

83

sic data matrix and transforms it into a Frobenius-norm regularization

84

of the latent factors.

85

2. LNMFS incorporates the low rank constraint with the graph smooth-

86

ness constraint on the Stiefel manifold to utilize the local geometry of

87

the data manifold and reduce the redundancy of the data representa-

88

tion.

89

3. LNMFS is formulated as an optimization problem with a well-defined

90

objective function. Two efficient multiplicative updating rules are pro-

91

vided for the optimization of the basis and coefficient matrices. Both

92

the theoretical proof and empirical analysis are provided to demon5

93

strate the convergence of the optimization scheme.

94

4. Experimental results demonstrate that LNMFS outperforms nine rep-

95

resentative NMF-based algorithms in clustering tasks on various real-

96

world datasets. The comparison on noisy datasets further demonstrates

97

the noise robustness of LNMFS.

98

The remainder of this paper is organized as follows. Section 2 provides a

99

brief introduction to the Stiefel manifold and representative NMF algorithms.

100

Section 3 elaborates the formulation of the proposed LNMFS framework and

101

provides an efficient algorithm for solving the optimization problem. Section

102

4 presents extensive experimental results by LNMFS in comparison with nine

103

representative NMF-based methods. Finally, Section 5 concludes this paper

104

and provides possible directions for future work.

105

2. Preliminaries

106

Herein, we provide preliminaries regarding optimization on the Stiefel

107

manifold. For clarity, Table 1 summarizes important notations used in this

108

paper.

109

2.1. Stiefel Manifold

110

Let S denote the set of orthonormal matrices, S , {U ∈ Rm×d : UT U = Id }

111

112

(1)

where Id ∈ Rd×d is the identity matrix. The set S is called (orthogonal) Stiefel manifold, whose dimension is dim(S) = md − 21 d(d + 1). The tangent 6

Table 1: Summary of notations

Notation

113

Description

X

observed data matrix

Y

intrinsic data matrix without noise in X

E

error matrix introduced by noise and outliers

U

basis matrix

V

coefficient matrix

L

Laplacian matrix of graph

W

similarity matrix

Q

objective function

xi

ith column vector in X

vi

ith column vector in V

vi

ith row vector in V

n

number of data points in X

m

dimension of data points in X

r

rank of Y

d

embedding dimension of matrix factorization

λ

weight of the graph smoothness constraint

β

weight of the low-rank constraint

space to S at the point U is obtained by differentiating UT U = I, yielding TU S = {Z ∈ Rm×d : UT Z + ZT U = 0}

7

(2)

114

where UT Z ∈ Rd×d is a skew-symmetric matrix. On the tangent space TU S

115

for any U ∈ S, we introduce the canonical inner product [13] as its metric, 1 hZ1 , Z2 ic = tr(ZT1 (I − UUT )Z2 ) 2

116

whereas the Euclidean inner product is hZ1 , Z2 ie = tr(ZT1 Z2 )

117

118

119

120

(3)

(4)

Then, we can view S as a sub-manifold of a Riemannian manifold. Let Q be an objective function defined on the Stiefel manifold. We use

e U Q to represent the gradient of Q at U in the Euclidean space ∇U Q and ∇ D E e U Q, Z ; and on the Stiefel manifold, respectively. Let h∇U Q, Zie = ∇ c

121

then, this is equivalent to

e U Q)T (I − 1 UUT )Z) tr((∇U Q)T Z) = tr((∇ 2

122

123

124

125

e U Q: By solving Eq. (5), we can obtain the representation of ∇

2.2. Orthogonal NMF

e U Q = ∇U Q − U(∇U Q)T U ∇

(5)

(6)

Given a nonnegative observed data matrix X = [x1 , x2 , · · · , xn ] ∈ Rm×n , +

th xi ∈ Rm observed data point, NMF seeks to factorize X + represents the i

126

n×d into two nonnegative matrices, that is, U ∈ Rm×d and V ∈ R+ (d  +

127

min(m, n)). X ≈ UVT 8

(7)

128

Owing to the non-negativity constraints, NMF provides a parts-based data

129

representation, which is easy to interpret in many application scenarios. In

130

Eq. (7), U is regarded as the basis matrix, whereas V is regarded as the

131

coefficient matrix and is used for tasks such as clustering. Both U and V can

132

be obtained by minimizing a cost function that quantifies the approximation

133

error to X. The most commonly used cost function is the square of the

134

Euclidean distance.

2 min X − UVT F

(8)

U≥0,V≥0 135

We note that the cost function (8) is non-convex in both variables, but it is

136

convex in U or V only. Therefore, the objective function can be solved by

137

an iterative algorithm [22] with the following multiplicative rules: U←U

XV , UVT V

V ←U

XT U VUT U

(9)

138

where and the fraction bar denote elementwise matrix product and divi-

139

sion, respectively.

140

Based on standard NMF, Ding et al. [12] proposed ONMF models, which

141

impose orthogonality constraints on the basis and coefficient matrices. We

142

refer to the orthogonal NMF with orthogonality constraints on U as ONMF-

143

U, that is,

2 min X − UVT F

U≥0,V≥0 UT U=I

(10)

144

and to the orthogonal NMF with orthogonality constraints on V as ONMF-

145

V. There are two major approaches for solving the objective functions of

146

ONMF. One is proposed by Ding et al. and considers additive orthogonality

9

147

constraints based on the Lagrangian multipliers method. Another is devel-

148

oped by Choi [8] and directly uses the true gradient (natural gradient) on

149

the Stiefel manifold. Comparatively, the latter has lower computational com-

150

plexity [39] than Ding’s method. For ONMF-U on the Stiefel manifold, the

151

multiplicative update rules are as follows: U←U

152

XV XT U , V ← V UVT XT U VUT U

(11)

3. Proposed Method

153

In this section, we propose the novel LNMFS algorithm, which incorpo-

154

rates the low-rank constraint on the intrinsic data, the orthogonality con-

155

straint on the basis matrix, and the graph smoothness constraint on the

156

coefficient matrix into the standard NMF framework.

157

3.1. Objective Function

158

159

160

Without loss of generality, the observed data matrix (X ∈ Rm×n ) can be +

considered the sum of two parts: the intrinsic data matrix (Y ∈ Rm×n ) and + the error matrix E ∈ Rm×n .

161

We first assume that the intrinsic data lie on a low-dimensional manifold,

162

which implies that Y has a low rank. Low rank is an important but ill-posed

163

problem in the development of NMF-based algorithms because the essential

164

information is often encoded in the low-rank intrinsic data matrix, whereas

165

noise and outliers are contained in the error matrix. Formally, the initial

10

166

objective function is min 12 kX − Yk2F + λ kYk∗

(12)

s.t. Y = UVT , U ≥ 0, V ≥ 0 167

168

In Eq. (12), kYk∗ is the nuclear norm of Y that plays the role of regularizing the rank of Y. By replacing Y with UVT , we can rewrite Eq. (12) as min

U≥0,V≥0

169



1

X − UVT 2 + λ UVT F ∗ 2

According to the matrix H¨older inequality,



UVT ≤ kUk kVk F F ∗

170

(13)

(14)

As kUkF kVkF ≤ 21 (kUk2F + kVk2F ), we have



UVT ≤ 1 (kUk2 + kVk2 ) F F ∗ 2

(15)

171

Therefore, the nuclear-norm regularized objective function (Eq. (13)) is

172

equivalent to the following non-convex problem: min

U≥0,V≥0

1

X − UVT 2 + λ (kUk2 + kVk2 ) F F F 2 2

(16)

173

Secondly, in several real-world applications, non-overlapped bases are of-

174

ten preferable in parts-based interpretation (e.g., image segmentation). For

175

obtaining distinct parts-based patterns, we suppose that the basis matrix lies

11

176

on the Stiefel manifold, leading to min

U≥0,V≥0 UT U=I

1

X − UVT 2 + λ (kUk2 + kVk2 ) F F F 2 2

(17)

177

We note that the Frobenius norm of U is fixed under the orthogonality

178

constraint (kUk2F = tr(UT U) = d). Therefore, Eq. (17) can be written in a

179

more concise form as follows: min

U≥0,V≥0 UT U=I

1

X − UVT 2 + λ kVk2 F F 2 2

(18)

180

Finally, to preserve the local geometric structure of the underlying data

181

manifold, LNMFS also incorporates the graph smoothness regularization

182

term into the objective function. min

U≥0,V≥0 UT U=I

Q=

1

X − UVT 2 + λ kVk2 + β tr(VT LV) F F 2 2 2 1 2

P

(19)

2

183

In Eq. (19), tr(VT LV) =

184

regularizer, L = D − W is the Laplacian matrix, v i represents the coefficient

i,j

wij kv i − v j k is the graph smoothness

185

of the ith data point, W = [wij ]n×n is the pairwise similarity matrix, and

186

D = diag(W1) is the diagonal degree matrix. The parameters λ and β

187

control the weight of the low rank and graph smoothness regularization terms,

188

respectively.

12

189

190

191

3.2. Optimization We minimize the objective function Eq. (19) by first computing the derivative of Q with respect to U and V in the Euclidean space. ∇U Q = UVT V − XV

(20)

∇V Q = (VUT U + λV + βDV) − (XT U + βWV)

(21)

192

193

194

e U Q) As U lies on a Stiefel manifold, we compute the natural gradient of U (∇ on the Stiefel manifold.

e U Q = ∇U Q − U(∇U Q)T U ∇

= (UVT V − XV) − U(UVT V − XV)T U

(22)

= UVT XT U − XV 195

Using the KKT conditions and KKT complementary slackness, that is, e U Q = 0, U ∇

196

V ∇V Q = 0

(23)

we obtain the multiplicative update rules of U and V as follows: XVt Ut Vt T XT Ut

(24)

XT Ut + βWVt Vt Ut T Ut + λVt + βDVt

(25)

Ut+1 ← Ut 197

Vt+1 ← Vt

13

Algorithm 1 Algorithm of LNMFS Input: Nonnegative observed data matrix X Pairwise similarity matrix W Parameters λ and β Output: Nonnegative factor matrices U and V 1: t ← 0, randomly initialize U0 on Stiefel manifold and V0 2: Compute D ← diag(W1) 3: repeat XVt 4: Ut+1 ← Ut t t T T t UV X U Vt+1 ← Vt

5:

XT Ut + βWVt Vt Ut T Ut + λVt + βDVt

t←t+1 7: until the convergence criterion is satisfied 8: return Ut+1 , Vt+1 6:

198

Algorithm 1 summarizes the procedures of the proposed LNMFS algorithm.

199

Theorem 1 provides the convergence proof of the multiplicative updating

200

rules.

201

3.3. Convergence

202

To prove the convergence of the two updating rules in Eqs. (24) and

203

(25), we start from the definition of the auxiliary function G. Let QU (V)

204

represent the objective function Q(U, V) when U is fixed.

205

˜ is an auxiliary function to QU (V) if and only if the Definition 1. G(V, V)

206

207

208

˜ ≥ QU (V), G(V, V) = QU (V) are satisfied. conditions G(V, V)

˜ is an upper bound of QU (V) and is As the auxiliary function G(V, V)

˜ the optimization of QU (V) can be replaced by the iterative tight for V = V, 14

209

update V(t+1) = arg min G(V, V(t) ) V

210

(26)

because by combining Eq. (26) with Definition 1, we have

QU (V(t+1) ) ≤ G(V(t+1) , V(t) ) ≤ G(V(t) , V(t) )

(27)

= QU (V(t) ) 211

Lemma 1. Given a semi-definite similarity matrix W, ˜ T WV) ˜ − 2tr(VT WV) ˜ −tr(VT WV) ≤ tr(V

212

(28)

Proof. As W is semi-definite, ˜ T W(V − V)) ˜ 0 ≤ tr((V − V)

(29)

˜ T WV) ˜ − 2tr(VT WV) ˜ = tr(VT WV) + tr(V 213

we obtain ˜ T WV) ˜ − 2tr(VT WV) ˜ −tr(VT WV) ≤ tr(V

(30)

214

215

Theorem 1. The objective function in Eq. (19) is nonincreasing under the

216

updating rules in Eqs. (24) and (25).

15

217

Proof. For all k ∈ {1, 2, . . . , d}, we define uik v˜jk η˜ijk = P , uil v˜jl

(31)

l

218

which satisfies

P

k

η˜ijk = 1. QU (V) can be decomposed into two parts, QU (V) = EU (V) + RU (V)

219

where EU (V) =

220

RU (V) =

(32)

1

X − UVT 2 F 2

(33)

β λ kVk2F + tr(VT LV) 2 2

(34)

221

EU (V) evaluates the approximation error, whereas RU (V) summarizes the

222

regularization terms related to V. As EU (V) is convex, by using Jensen’s

223

inequality, we have

EU (V) =

1 2

=

1 2

P P (xij − uik vjk )2 i,j

k

P P u vjk 2 (xij − η˜ijk ηik ) ˜ijk i,j

k

PPP 1



2

i

j

k

η˜ijk (xij −

(35)

uik vjk 2 ) η˜ijk

˜ , GE (V, V) 224

˜ = Therefore, GE (V, V)

225

tion to EU (V).

1 2

P P P i

j

k

η˜ijk (xij −

16

uik vjk 2 ) η˜ijk

is the auxiliary func-

226

Moreover, based on Lemma 1, whose element-wise formulation is −

227

X i,j,k

X

wij vik vjk ≤

X

wij v˜ik v˜jk − 2

i,j,k

wij v˜ik vjk

(36)

i,j,k

we have RU (V) =

λ 2

=

λ 2



λ 2

P j,k

P j,k

P j,k

−β

2 vjk +

β 4

2 vjk +

β 2

2 + vjk

β 2

P

P

i,j,k

P j,k

P j,k

wij (vik − vjk )2

2 dj vjk −

β 2

2 + dj vjk

β 2

P

wij vik vjk

i,j,k

P

wij v˜ik v˜jk

(37)

i,j,k

wij v˜ik vjk

i,j,k

˜ , GR (V, V) 228

229

230

˜ = Hence, GR (V, V)

λ 2

P j,k

β 2

2 vjk +

P j,k

2 dj vjk +

β 2

P

i,j,k

the auxiliary function to RU (V).

wij v˜ik v˜jk −β

P

wij v˜ik vjk is

i,j,k

˜ as an auxiliary function to QU (V). We now define G(V, V) ˜ = GE (V, V) ˜ + GR (V, V) ˜ G(V, V) =

1 2

P

i,j,k

+

β 2

−β

η˜ijk (xij −

P j,k

2 dj vjk +

P

uik vjk 2 ) η˜ijk β 2

wij v˜ik vjk

i,j,k

≥ QU (V)

17

P

i,j,k

+

λ 2

P j,k

wij v˜ik v˜jk

2 vjk

(38)

231

232

We note that G(V, V) = QU (V) is trivially met. By taking the derivative

˜ with respect to vjk , we have of G(V, V)

˜ = P(P uil v˜jl )uik vjk − P xij uik + λvjk ∇vjk G(V, V) v˜jk i

i

l



P j,k

=

dj vjk − β

v

−β

P

j,k

v

jk dj v˜jk v˜jk

wij v˜ik

i

h

i h i  T  V ˜ TU − X U jk VU ˜ V jk

˜ jk + λ[V]

h i

V ˜ V jk

h i ˜ − β WV

jk

h i h i V ˜ + β DV ˜ V jk

(40) jk

jk

˜ be equal to 0, we have the update rule of vjk : By letting ∇vjk G(V, V)

vjk = v˜jk h

236

P

Eq. (39) can be rewritten in the following matrix form. ˜ = ∇vjk G(V, V)

235

(39)

i

l

+β + λ˜ vjk v˜jk jk

234

wij v˜ik

i

PP P v ( uil v˜jl )uik v˜jk − xij uik jk i

233

P

h

˜ XT U + βWV

i

jk

˜ T U + λV ˜ + βDV ˜ VU

i

(41) jk

which is equivalent to the update rule of V in matrix form in Eq. (25). ˜ is an auxiliary function, QU (V) is nonincreasing under this As G(V, V) 18

237

238

update rule. The update rule for U can be obtained similarly by using the natural

239

gradient in the Stiefel manifold.

240

3.4. Time Complexity

241

The time complexity of the update rule for U (Eq. (24)) should be

242

analyzed in two parts: the numerator and the denominator. As n and m

243

represent the number and dimension of the data points in X, respectively,

244

the time complexity of computing the numerator (XV) is O(mnd), and the

245

complexity of the denominator (UVT XT U) is O(mnd). Therefore, if the

246

updates converge after τ iterations, the computational time cost of updating

247

U is O(mndτ ).

248

The time complexity of the update rule for V (Eq. (25)) is analyzed

249

similarly. The computation of the numerator (XT U+βWV) costs O(mnd)+

250

O(n2 d) = O(nd·max(n, m)). The computation of the denominator (VUT U+

251

λV + βDV) costs O(mnd) + O(nd) = O(mnd). Therefore, if the updates

252

converge after τ iterations, the time complexity of updating V is O(ndτ ·

253

max(n, m)).

254

To summarize, the overall time complexity of LNMFS is O(mndτ ) +

255

O(ndτ ·max(n, m)) = O(ndτ ·max(n, m)), where τ is the number of iterations

256

before convergence.

257

4. Experiments

258

In this section, we evaluate the performance of the proposed LNMFS

259

algorithm in comparison with nine widely used nonnegative matrix decom-

260

position methods on real-world datasets. First, we compare the clustering 19

Table 2: Summary of experimental datasets

Dataset

# Instances

# Dimensions

# Clusters

Yale

165

4096

15

Wine

178

13

3

Pendigits

477

16

3

Breast

683

9

2

COIL20

1440

1024

20

Optical digits

5620

16

10

TDT2

10212

36771

95

NC vs AD

120

90

2

NC vs MCI

120

90

2

NC vs MCI vs AD

180

90

3

261

performance and runtime performance of the ten NMF-based methods, in-

262

cluding the proposed LNMFS, on various clustering tasks. Subsequently, we

263

compare the clustering performance of the ten NMF-based algorithms on in-

264

creasingly noisy datasets. Then, we illustrate the empirical convergence of

265

LNMFS on the experimental datasets, and we discuss the effect of the regu-

266

larization parameters on the clustering performance of LNMFS. Finally, we

267

demonstrate the relationship of the embedding dimension and the clustering

268

performance of LNMFS in comparison with other NMF-based algorithms.

269

4.1. Experimental Settings

270

We evaluate the performance of LNMFS in comparison with nine repre-

271

sentative nonnegative matrix decomposition methods: standard NMF (NMF),

272

orthogonal NMF (ONMF-U and ONMF-V), SNMF, GNMF, RNMF, GRNMF,

273

probabilistic NMF (PNMF) and greedy orthogonal pivoting algorithm for 20

274

NMF (GOPA). Among the compared algorithms, ONMF-U imposes orthog-

275

onality constraints on the basis matrix to produce distinctive parts. ONMF-

276

V imposes orthogonality constraints on the coefficient matrix to produce

277

distinctive cluster indicating vectors. SNMF regularizes the sparsity of the

278

coefficient matrix to reduce redundancy. GNMF incorporates a manifold reg-

279

ularization term to maintain the local geometrical structure. RNMF enforces

280

a sparsity constraint on the residual matrix to improve model robustness.

281

GRNMF adopts a group-sparse outlier residual term to handle possible non-

282

linear effects. PNMF introduces variational Bayesian inference to achieve

283

deterministic convergence to the solution of NMF instead of relying on ran-

284

dom draws [4]. GOPA fully exploits the sparsity of non-negative orthogonal

285

solutions to promotes exact orthogonality [45].

286

287

288

289

290

291

We compare the ten NMF-based algorithms on the real-world experimental datasets summarized in Table 2. • Yale is a face dataset composed of 165 grayscale images from 15 individuals. • Wine is a chemical analysis dataset that quantifies 13 constituents found in three types of wines.

292

• Pendigits is a pen-based recognition dataset of handwritten digits from

293

UCI. We adopt a subset of the original dataset, which is composed of

294

3, 8 and 9 three digits.

295

• Breast dataset comes from Breast Cancer Wisconsin Dataset in UCI.

296

The missing data and the first attribute (ID number) of the dataset

297

are removed. 21

298

299

300

301

• COIL20 consists of 1440 32 × 32 grayscale images for 20 objects at different angles in a 360◦ rotation.

• Optical digits is an optical recognition dataset of handwritten digits from 43 people.

302

• TDT2 is a collection of text documents that include six months of

303

material drawn on a daily basis from six English news sources. The

304

documents appearing in two or more categories are removed.

305

Furthermore, we apply the compared algorithms to the Alzheimer’s disease

306

image dataset (Figure 1) from Alzheimer’s Disease Neuroimaging Initiative

307

(ADNI) database. Alzheimer’s disease (AD) [40, 18] is an irreversible, pro-

308

gressive neurological brain disorder that slowly destroys brain cells, causes

309

memory and thinking skill losses, and ultimately loss of the ability to carry

310

out simple tasks. In this experiment, we adopt 180 magnetic resonance imag-

311

ing (MRI) images, including 60 normal control (NC), 60 mild cognitive im-

312

pairment (MCI, a prodromal stage of AD) and 60 AD patients. From each

313

MRI scan, we extract 90 cortical thickness features because cortical thickness

314

is known to be associated with memory performance, cognitive decline and

315

progression to dementia [5], and it has been a widely used predictor for AD

316

diagnosis [29].

317

In the preprocessing of the Alzheimer’s disease image dataset, we first per-

318

form anterior commissure–posterior commissure correction and skull-striping.

319

Then, we segment the brain images into gray matter, white matter, and CSF

320

three parts, where the gray matter is further divided into 90 regions of interest

321

(ROIs) after registration with the automated anatomical labeling template 22

(a) Axial view

(b) Sagittal view

(c) Coronal View

Figure 1: Illustration of a sample MRI scan from the Alzheimer’s disease image dataset

322

[36]. Subsequently, we calculate the cortical thickness of each ROI between

323

the inner and outer surface as the features of MRI images [11]. The entire

324

process is performed in CAT12 (Computational Anatomy Toolbox). With

325

the extracted cortical thickness features, we perform two binary clustering

326

tasks (NC vs AD, NC vs MCI) and one three-way clustering task (AD vs

327

MCI vs NC) on the Alzheimer’s disease dataset (Table 2).

328

For implementation, we adopt the standard NMF with multiplicative up-

329

date strategy in Matlab1 [3]. For ONMF-U, ONMF-V, SNMF, RNMF and

330

PNMF, we use their implementations in the comprehensive package NM-

331

Flibrary2 . The codes of GNMF3 , GRNMF4 and GOPA5 are provided by

332

their authors. As both GNMF and LNMFS are graph-dependent, we pro-

333

vide them with the same pairwise similarity matrix (W) on each dataset. For

334

the text document datasets, we compute their normalized tf-idf features and 1

https://ww2.mathworks.cn/help/stats/nnmf.html https://github.com/hiroyuki-kasai/NMFLibrary 3 http://www.cad.zju.edu.cn/home/dengcai/Data/GNMF.html 4 https://www.irit.fr/˜Cedric.Fevotte/publications.html 5 https://github.com/kzhang980/ORNMF 2

23

335

use cosine function to calculate the document similarities. For the non-text

336

datasets, we construct 0/1 neighbor graphs, where wij = 1 if the j th data

337

point is one of the k nearest neighbors of the ith data point or vice versa,

338

and wij = 0 otherwise.

339

For each dataset, we follow the existing NMF models [45, 48, 46] and set

340

the embedding dimension of the matrix decomposition to be the same as the

341

number of clusters, so that each dimension of the latent feature space corre-

342

sponds to a distinct cluster. As LNMFS assumes that the basis matrix lies

343

on a Stiefel manifold, we initialize its U as a random orthogonal matrix. The

344

initialization strategy is as follows. First, we generate a random nonnegative

345

matrix in the range (0, 1). Then, we keep only the maximal value in each

346

row unchanged and set the other elements to 0. Finally, we perform normal-

347

ization to ensure that U0 U0 = I. The initial V of LNMFS is randomly

348

initialized in the range (0, 1). For the other NMF-based algorithms, their

349

initial U and V are randomly initialized following their default settings. For

350

each algorithm, the maximal number of iterations is set to 500, and the con-

351

vergence tolerance to 10−5 . The regularization parameters of each algorithm,

352

if any, are optimized over the range 10(−5:0.5:5) . For the graph-dependent al-

353

gorithms (i.e., GNMF and LNMFS), we construct a 3-NN graph on Yale and

354

TDT2 datasets, and a 10-NN graph for the other datasets using the (stan-

355

dardized) Euclidean distance. After the nonnegative matrix decomposition,

356

the k-means algorithm is applied to the coefficient matrix (V) to produce

357

the clustering results.

T

24

Table 3: Comparison of clustering performance in terms of Rand index Algorithm

358

Yale

Wine

Pendigits

Breast

COIL20

NMF

0.9010±0.0091

0.6499±0.0127

0.7215±0.0726

0.9082±0.0213

0.9495±0.0051

ONMF-U

0.8948±0.0109

0.6751±0.0248

0.6984±0.0814

0.9113±0.0089

0.9514±0.0073

ONMF-V

0.7514±0.0808

0.6111±0.0590

0.7130±0.0742

0.7309±0.0817

0.9201±0.0335

SNMF

0.8892±0.0090

0.7034±0.0431

0.8054±0.0058

0.8085±0.0461

0.9199±0.0075

GNMF

0.9165±0.0073

0.7045±0.0135

0.8985±0.0658

0.9404±0.0006

0.9664±0.0032

RNMF

0.9158±0.0062

0.6667±0.0206

0.8300±0.0450

0.9253±0.0050

0.9566±0.0031

GRNMF

0.9117±0.0062

0.6990±0.0207

0.8352±0.0298

0.9267±0.0000

0.9555±0.0024

PNMF

0.8957±0.0070

0.7050±0.0132

0.7854±0.0000

0.9146±0.0024

0.9448±0.0073

GOPA

0.8922±0.0307

0.6922±0.0468

0.7145±0.0607

0.9307±0.0027

0.9512±0.0045

LNMFS

0.9248±0.0027

0.8971±0.0028

0.9300±0.0028

0.9421±0.0013

0.9733±0.0017

Algorithm

Optical Digits

TDT2

NC vs AD

NC vs MCI

NC vs MCI vs AD

NMF

0.9208±0.0081

0.9102±0.0014

0.4975±0.0026

0.4987±0.0059

0.5588±0.0052

ONMF-U

0.9111±0.0083

0.9101±0.0011

0.5849±0.0071

0.5453±0.0165

0.5777±0.0096

ONMF-V

0.9031±0.0273

0.8301±0.0266

0.5046±0.0087

0.5016±0.0068

0.5526±0.0057

SNMF

0.9359±0.0051



0.5314±0.0162

0.5179±0.0218

0.5529±0.0528

GNMF

0.9573±0.0067

0.9484±0.0047

0.6304±0.0000

0.5765±0.0001

0.5983±0.0007

RNMF

0.9283±0.0064



0.5399±0.0239

0.5182±0.0139

0.5616±0.0133

GRNMF

0.9232±0.0069



0.6052±0.0017

0.5464±0.0039

0.5907±0.0077

PNMF

0.9112±0.0066



0.4993±0.0000

0.4989±0.0001

0.5535±0.0051

GOPA

0.9003±0.0095



0.5905±0.0000

0.5566±0.0076

0.5787±0.0028

LNMFS

0.9649±0.0077

0.9546±0.0030

0.6392±0.0000

0.5905±0.0000

0.6019±0.0011

4.2. Performance Comparison

359

We evaluate the clustering performance of ten NMF-based algorithms us-

360

ing four criteria: Rand index [2], normalized mutual information (NMI) [34],

361

F-measure [30], and purity [27] (Tables 3–6). Among the four criteria, Rand

362

index focuses on the ratio of “good choices”, NMI is defined using mutual

363

information, F-measure considers both precision and recall rate, and purity

364

measures the extent to which clusters contain a single class. Twenty test runs

365

were conducted to compute the average clustering performance and standard

366

deviation for each criterion. “—” indicates failure to produce clustering re25

Table 4: Comparison of clustering performance in terms of NMI Algorithm

Yale

Wine

Pendigits

Breast

COIL20

NMF

0.5422±0.0409

0.4206±0.0259

0.5094±0.0712

0.7115±0.0439

0.7392±0.0157

ONMF-U

0.5522±0.0262

0.3404±0.0737

0.5003±0.0803

0.7166±0.0208

0.7553±0.0198

ONMF-V

0.4493±0.0454

0.2807±0.0523

0.5035±0.0599

0.3802±0.1349

0.6856±0.0443

SNMF

0.4620±0.0191

0.4239±0.0152

0.6120±0.0111

0.5228±0.0918

0.6894±0.0112

GNMF

0.6049±0.0204

0.4491±0.0100

0.7625±0.0951

0.7925±0.0016

0.8439±0.0044

RNMF

0.5934±0.0170

0.4268±0.0237

0.6341±0.0674

0.7514±0.0124

0.7666±0.0073

GRNMF

0.5840±0.0214

0.4058±0.0175

0.6314±0.0523

0.7546±0.0000

0.7476±0.0102

PNMF

0.4776±0.0203

0.4421±0.0202

0.5519±0.0000

0.7252±0.0056

0.7138±0.0109

GOPA

0.5343±0.0420

0.3812±0.1122

0.4702±0.0815

0.7649±0.0068

0.7556±0.0176

LNMFS

0.6369±0.0142

0.7856±0.0036

0.8166±0.0100

0.7965±0.0039

0.8588±0.0064

Algorithm

Optical Digits

TDT2

NC vs AD

NC vs MCI

NC vs MCI vs AD

NMF

0.6605±0.0228

0.6808±0.0063

0.0026±0.0039

0.0043±0.0087

0.0224±0.0065

ONMF-U

0.6323±0.0269

0.6367±0.0097

0.1334±0.0107

0.0726±0.0250

0.0779±0.0128

ONMF-V

0.6773±0.0248

0.6165±0.0184

0.0135±0.0140

0.0086±0.0102

0.0127±0.0049

SNMF

0.7263±0.0123



0.0567±0.0224

0.0405±0.0309

0.0531±0.0217

GNMF

0.8612±0.0091

0.7993±0.0080

0.2069±0.0000

0.1196±0.0002

0.0944±0.0009

RNMF

0.6862±0.0175



0.0705±0.0382

0.0371±0.0247

0.0382±0.0089

GRNMF

0.6619±0.0195



0.1643±0.0010

0.0751±0.0074

0.0879±0.0131

PNMF

0.6430±0.0132



0.0055±0.0000

0.0045±0.0001

0.0094±0.0025

GOPA

0.5561±0.0362



0.1423±0.0000

0.0896±0.0117

0.0804±0.0040

LNMFS

0.8770±0.0155

0.8112±0.0062

0.2229±0.0000

0.1403±0.0000

0.1005±0.0026

367

sults in reasonable time. Moreover, we report the average runtime and actual

368

number of iterations for each algorithm on the experimental datasets (Table

369

7). All the experiments are performed in MATLAB R2018a on a PC with

370

3.7 GHz Intel Core i7 and 16 GB RAM.

371

In Tables 3-6, it can be seen that the proposed LNMFS exhibits the

372

best clustering performance. This is particularly obvious on the Wine and

373

Pendigits datasets, where LNMFS achieves an improvement of Rand index

374

by 0.2 and 0.1, respectively, over the best among the other algorithms. This

375

demonstrates the effectiveness of LNMFS in leveraging the power of low-rank

26

Table 5: Comparison of clustering performance in terms of F-measure Algorithm

Yale

Wine

Pendigits

Breast

COIL20

NMF

0.3178±0.0403

0.5838±0.0228

0.6583±0.0598

0.9169±0.0180

0.5473±0.0306

ONMF-U

0.3343±0.0230

0.5662±0.0280

0.6354±0.0759

0.9193±0.0078

0.5732±0.0390

ONMF-V

0.1984±0.0377

0.5002±0.0407

0.6376±0.0528

0.7715±0.0756

0.4388±0.0823

SNMF

0.2477±0.0186

0.5784±0.0124

0.7212±0.0073

0.8354±0.0380

0.4355±0.0214

GNMF

0.4001±0.0284

0.6028±0.0161

0.8546±0.0824

0.9451±0.0005

0.6937±0.0196

RNMF

0.3852±0.0230

0.6141±0.0188

0.7559±0.0528

0.9319±0.0044

0.5946±0.0163

GRNMF

0.3680±0.0260

0.5929±0.0169

0.7593±0.0415

0.9331±0.0000

0.5815±0.0178

PNMF

0.2611±0.0180

0.5915±0.0175

0.6914±0.0000

0.9224±0.0021

0.5148±0.0267

GOPA

0.3170±0.0468

0.5649±0.0531

0.6225±0.0653

0.9367±0.0024

0.5646±0.0307

LNMFS

0.4447±0.0209

0.8459±0.0041

0.8951±0.0042

0.9467±0.0012

0.7406±0.0149

Algorithm

Optical Digits

TDT2

NC vs AD

NC vs MCI

NC vs MCI vs AD

NMF

0.6153±0.0344

0.2069±0.0123

0.4997±0.0113

0.5040±0.0082

0.3466±0.0064

ONMF-U

0.5769±0.0330

0.1820±0.0086

0.5833±0.0069

0.5422±0.0173

0.3919±0.0178

ONMF-V

0.5861±0.0578

0.2268±0.0269

0.5060±0.0123

0.5015±0.0091

0.3415±0.0108

SNMF

0.6889±0.0210



0.5483±0.0273

0.5705±0.0529

0.3891±0.0331

GNMF

0.8015±0.0246

0.6801±0.0454

0.6298±0.0000

0.5739±0.0002

0.3952±0.0006

RNMF

0.6514±0.0283



0.5504±0.0244

0.5331±0.0239

0.3585±0.0101

GRNMF

0.6244±0.0282



0.6331±0.0004

0.5451±0.0077

0.3907±0.0126

PNMF

0.5780±0.0220



0.5106±0.0000

0.4992±0.0022

0.3361±0.0067

GOPA

0.5149±0.0432



0.5891±0.0000

0.5541±0.0080

0.3793±0.0030

LNMFS

0.8330±0.0329

0.7246±0.0239

0.6394±0.0000

0.5872±0.0000

0.4027±0.0001

376

and graph smoothness on the latent Stiefel manifold. Moreover, the stan-

377

dard deviation of LNMFS is relatively low compared with that of the other

378

algorithms on most datasets. This suggests the stability of the proposed

379

approach. Among the other nine NMF-based algorithms, GNMF exhibits

380

the second best clustering performance on most experimental datasets. To-

381

gether with LNMFS, it illustrates the importance of preserving the local

382

manifold data structure in clustering tasks. In addition to GNMF, ONMF-U

383

and GOPA are another two NMF-based algorithms closely related to LN-

384

MFS because they both impose orthogonal constraints on U. Compared

27

Table 6: Comparison of clustering performance in terms of purity Algorithm

Yale

Wine

Pendigits

Breast

COIL20

NMF

0.5858±0.0343

0.8124±0.0466

0.8136±0.0325

0.9517±0.0124

0.6993±0.0225

ONMF-U

0.6518±0.0227

0.7222±0.0319

0.8274±0.0366

0.9535±0.0049

0.7310±0.0168

ONMF-V

0.6685±0.0549

0.6744±0.0954

0.8049±0.0248

0.7890±0.1218

0.6856±0.0214

SNMF

0.5148±0.0257

0.7202±0.0615

0.8420±0.0054

0.8920±0.0274

0.7188±0.0145

GNMF

0.6412±0.0289

0.7514±0.0446

0.9244±0.0398

0.9693±0.0003

0.8372±0.0134

RNMF

0.6394±0.0184

0.8562±0.0529

0.8651±0.0309

0.9612±0.0027

0.7231±0.0235

GRNMF

0.6267±0.0264

0.7522±0.0605

0.8625±0.0280

0.9619±0.0000

0.7108±0.0167

PNMF

0.5300±0.0235

0.7270±0.0547

0.8197±0.0000

0.9553±0.0013

0.6806±0.0180

GOPA

0.6121±0.0284

0.6983±0.0405

0.7845±0.0528

0.9641±0.0015

0.7174±0.0229

LNMFS

0.7151±0.0087

0.9149±0.0027

0.9439±0.0023

0.9702±0.0001

0.8452±0.0089

Algorithm

Optical Digits

TDT2

NC vs AD

NC vs MCI

NC vs MCI vs AD

NMF

0.7523±0.0266

0.4152±0.0156

0.5629±0.0539

0.5954±0.0364

0.4572±0.0184

ONMF-U

0.7227±0.0377

0.3664±0.0132

0.7100±0.0084

0.6546±0.0263

0.5414±0.0555

ONMF-V

0.7736±0.0194

0.5756±0.0390

0.5871±0.0313

0.5687±0.0302

0.4425±0.0382

SNMF

0.8123±0.0198



0.6762±0.0768

0.7409±0.1480

0.5664±0.1133

GNMF

0.9088±0.0156

0.7700±0.0384

0.7583±0.0000

0.7000±0.0002

0.5139±0.0085

RNMF

0.7845±0.0283



0.6629±0.0399

0.6500±0.0599

0.4758±0.0302

GRNMF

0.7551±0.0257



0.7329±0.0018

0.6538±0.0060

0.5233±0.0304

PNMF

0.7156±0.0213



0.6250±0.0000

0.5662±0.0122

0.4303±0.0217

GOPA

0.6590±0.0401



0.7167±0.0000

0.6733±0.0107

0.4764±0.0092

LNMFS

0.9207±0.0180

0.7985±0.0195

0.7667±0.0000

0.7417±0.0000

0.5722±0.0001

385

with standard NMF, ONMF-V and SNMF, they achieve superior clustering

386

results on most datasets. In contrast, ONMF-V (another orthogonal NMF

387

algorithm), has inferior clustering performance to that of ONMF-U and even

388

standard NMF on most datasets. PNMF has comparable clustering perfor-

389

mance with that of non-probabilistic standard NMF. Furthermore, the robust

390

NMF models (RNMF and GRNMF) rank approximately between GNMF

391

and ONMF-U. Their advantage lies in their model robustness by imposing

392

(group) sparsity constraints on the residue matrix. In fact, LNMFS also en-

393

hances the model robustness, but in a different manner, that is, by imposing

28

A

ORBsupmed. R ACG. R TPOsup. R

TPOsup. L

OLF. R

INS. L

DCG. R

DCG. L

STG. L

ORBsupmed. R

FFG. L

SMG. L

INS. L

DCG. L INS. L

ORBsupmed. R

LING. R

STG. L FFG. L

LING. R

SMG. L

ACG. R

SMG. L

TPOsup. L PCUN. L

Prefrontal

PCUN. L

PCUN. L

DCG. L

FFG. L

TPOsup. R

Frontal

TPOsup. L

TPOmid. L

TPOmid. L

B

Parietal

Temporal SFGmed. L ACG. L OLF. L ROL. L

ORBsup. R REC. R TPOmid. R ROL. R

ITG. L FFG. R

ACG. L

SMG. R ORBsup. R REC. R

SMG. R

SMG. R

SFGmed. L

INS. R

ROL. R

ROL. R INS. R

INS. R FFG. R

IOG. L

Occipital

SFGmed. L ACG. L ORBsup. R

FFG. R

IOG. L

ROL. L ITG. L

Subcortical

REC. R IOG. L

(I) Axial view

TPOmid. R

TPOmid. R

ITG. L

(II) Sagittal view

(III) Coronal view

Figure 2: (A) Significant brain ROIs in the first basis vector learnt by LNMFS from the Alzheimer’s disease dataset (NC vs AD). (B) Significant brain ROIs in the second basis vector learnt by LNMFS from the Alzheimer’s disease dataset (NC vs AD). The three different views (i.e., axial,Sagittal sagittal, brainView ROIs are depicted using Axial View View and coronal) ofCoronal BrainNet Viewer [43]. The size of each node (i.e., ROI) is proportional to its value in the basis vector. Different ROIs are rendered with different colors according to their anatomical locations, as suggested by [38].

394

the low-rank constraint on the intrinsic data matrix. This contributes to the

395

superior clustering performance of LNMFS as well. Finally, we notice that all

396

algorithms fail to produce as good clustering performance on the Alzheimer’s

397

disease dataset as on the other datasets. This may be because MRI images

398

contain a significant amount of noise, which may be introduced by the op-

399

erator, equipment and environment [26]. Nevertheless, LNMFS still exhibits

400

better clustering performance than the other NMF-based algorithms.

401

To explain the improved clustering performance of LNMFS, we take the

402

Alzheimer’s disease dataset (NC vs AD) as an example, and show its de-

29

403

composed basis matrix (U) as the extracted brain ROI features in Figure 2.

404

In Figure 2(A), ORBsupmed.R (right superior frontal gyrus, medial orbital)

405

and bilateral TPOsup, that is, TPOsup.L and TPOsup.R (temporal pole,

406

superior temporal gyrus), rank the top three significant brain ROIs in the

407

first basis vector of LNMFS. This agrees with the fact in neuroscience that

408

orbitofrontal cortex is involved in the cognitive process of decision-making,

409

and superior temporal gyrus is involved in the comprehension of language

410

and is among the earliest areas affected by Alzheimer’s disease [32]. There-

411

fore, these physically interpretable features extracted by the basis matrix

412

further improve the practicability of the proposed approach.

413

Regarding runtime performance (Table 7), NMF is the most efficient al-

414

gorithm among the ten NMF-based methods under comparison. LNMFS

415

costs less than 4 s on nine out of ten datasets, and it is the third most effi-

416

cient algorithm on the largest TDT2 dataset. However, half of the compared

417

algorithms, including SNMF, RNMF, GRNMF, PNMF and GOPA, fail to

418

produce any decomposition result on TDT2 dataset in reasonable time. Fur-

419

thermore, GOPA requires the smallest number of iterations (listed in the

420

parentheses), but it does not necessarily lead to the least runtime. Most of

421

the evaluated NMF-based algorithms (including LNMFS) reach the maxi-

422

mal number of iterations (500) before they satisfy the convergence tolerance

423

(10−5 ). However, this by no means indicates that they do not converge be-

424

cause the actual number of iterations depends on the magnitude of the spec-

425

ified convergence tolerance. Later (Section 4.4), we will see that LNMFS in

426

fact convergences quite rapidly.

30

Table 7: Comparison of averaged runtime performance (s). The numbers in the parentheses indicate the actual number of iterations. Algorithm

Yale

Wine

Pendigits

Breast

COIL20

NMF

3.02 (500)

0.02 (450)

0.02 (274)

0.01 (312)

4.18 (259)

ONMF-U

2.79 (500)

0.04 (500)

0.05 (500)

0.05 (500)

7.56 (500)

ONMF-V

3.20 (500)

0.04 (500)

0.05 (500)

0.04 (500)

8.06 (500)

SNMF

4.26 (500)

0.04 (500)

0.08 (500)

0.10 (500)

24.19 (500)

GNMF

0.57 (500)

0.02 (500)

0.04 (500)

0.06 (500)

3.42 (500)

RNMF

7.34 (500)

0.05 (500)

0.07 (500)

0.07 (500)

20.50 (500)

GRNMF

15.04 (500)

0.07 (500)

0.11 (499)

0.11 (160)

37.07 (500)

PNMF

162.54 (500)

0.25 (500)

0.46 (500)

0.34 (500)

533.12 (500)

GOPA

0.47 (89)

0.08 (18)

0.31 (31)

0.44 (23)

11.45 (55)

LNMFS

1.03 (500)

0.06 (500)

0.16 (500)

0.06 (386)

3.82 (500)

Optical Digits

TDT2

NC vs AD

Algorithm

427

NC vs MCI NC vs MCI vs AD

NMF

1.20 (485)

7.72 (2)

0.01 (242)

0.01 (236)

0.02 (242)

ONMF-U

1.25 (500)

2097.57 (500)

0.06 (500)

0.07 (500)

0.07 (500)

ONMF-V

1.14 (500)

2486.86 (500)

0.06 (500)

0.06 (500)

0.07 (500)

SNMF

32.96 (500)



0.07 (500)

0.07 (500)

0.09 (500)

GNMF

16.62 (500)

106.71 (500)

0.02 (500)

0.02 (500)

0.03 (500)

RNMF

9.57 (500)



0.09 (500)

0.09 (500)

0.11 (500)

GRNMF

6.77 (500)



0.06 (320)

0.07 (237)

0.18 (465)

PNMF

49.99 (500)



0.32 (500)

0.33 (500)

0.57 (500)

GOPA

121.90 (35)



0.03 (22)

0.03 (23)

0.07 (24)

LNMFS

2.04 (500)

876.71 (500)

0.06 (500)

0.06 (500)

0.10 (500)

4.3. Noise Robustness

428

In addition to the real-world noisy datasets such as Alzheimers’ disease

429

datasets (Section 4.2), herein, we evaluate LNMFS on Yale dataset with

430

added Salt & Pepper noise (Figure 3), in comparison with the other nine

431

NMF-based algorithms (Figure 4). Salt & Pepper noise is a common type of

432

corruption in images. Unfortunately, it is a challenging task to remove Salt &

433

Pepper noise in computer vision because it contaminates each pixel by zero

434

or the maximum pixel value, and its distribution violates the common noise 31

r = 0.05

r = 0.10

r = 0.15

r = 0.20

r = 0.25

r = 0.30

r = 0.35

r = 0.40

r = 0.45

r = 0.50

r = 0.05

r = 0.10

r = 0.15

r = 0.20

r = 0.25

r = 0.30

r = 0.35

r = 0.40

r = 0.45

r = 0.50

Figure 3: Example face images contaminated by Salt & Pepper noise with different noise ratios (from 5% to 50%).

(A) NMI

(B) Rand Index

(C) F-measure

(D) Purity

Figure 4: Clustering performance curves of ten NMF-based algorithms on Yale dataset with different noise ratios.

435

assumption in traditional learning methods. To demonstrate the robustness

436

of LNMFS, we vary the percentage of corrupted pixels from 5% to 50%. For

437

each case of additive Salt & Pepper noise, we conduct twenty test runs and

438

report the average clustering performance evaluated in terms of four different

439

criteria.

440

Figure 4 shows that LNMFS consistently exhibits superior performance

441

to that of the other algorithms despite the change of the noise ratios. This is

442

particularly obvious when the noise ratio is smaller than 0.2 or greater than

443

0.45. The two robust NMF models, that is, RNMF and GRNMF, surpris-

444

ingly exhibit comparable clustering performance with that of standard NMF.

445

This may be because their noise assumption violates the Salt & Pepper noise

446

distribution. ONMF-U exhibits considerably better clustering performance

32

447

than NMF and ONMF-V. Furthermore, GOPA has relatively stable clus-

448

tering performance when the noise ratio is smaller than or equal to 0.35,

449

but its performance starts to decline when the noise ratio grows larger than

450

0.4. Together with LNMFS, they demonstrate the effectiveness of orthogo-

451

nal constraints on U for handling Salt & Pepper noise. Finally, the Rand

452

index curve of ONMF-V is increasing as the noise ratio increases because its

453

clustering result becomes more balanced on noisier datasets.

454

4.4. Empirical Convergence

455

In Section 3.3, we theoretically proved the convergence of the multiplica-

456

tive updating rules of LNMFS. We will now investigate the empirical conver-

457

gence performance of LNMFS on the experimental datasets.

458

Figure 5 shows the convergence of the relative difference of U (Figure

459

5(A)) and V (Figure 5(B)), which are decomposed by LNMFS, as the number

460

of iterations increases. It can be seen that both U and V converge rapidly,

461

and U converges slightly faster than V. In the first 100 iterations, the relative

462

difference curves of U and V drop sharply, then decline slowly, and gradually

463

become stable. This demonstrates that LNMFS is a convergent algorithm.

464

LNMFS reaches the maximal number of iterations in the experiments (Table

465

7) because its maximal relative difference of U and V is close to but not

466

below the specified convergence tolerance (10−5 ) at the maximal number of

467

iterations.

468

4.5. Parameter Influence

469

The proposed LNMFS algorithm has two regularization parameters, the

470

weight of the low-rank regularization term λ and the weight of the graph 33

A

Yale

Wine

Pendigits

Breast

COIL20

Optical Digits

TDT2

NC vs AD

NC vs MCI

NC vs MCI vs AD

Yale

Wine

Pendigits

Breast

COIL20

Optical Digits

TDT2

NC vs AD

NC vs MCI

NC vs MCI vs AD

B

of LRNMF on the datasets.of(A) relative Figure 5: 5: Empirical Empiricalconvergence convergence of LNMFS. (A)experimental Relative difference U,The defined as ||Ut+1 Ut ||F t ||Ut+1 −Ut ||F t th di↵erence of U, defined as , with the increasing number of iterations, U , with increasing number iteration. t ||Ut ||F of iterations. U indicates U in the t ||U ||F

t+1

t

V ||F ||V relative −V ||Fdi↵erence of V, defined as ||V indicates U indifference the t thofiteration. (B)asThe , (B) Relative V, defined , with increasing number of iterations. ||Vt ||F ||Vt ||F t with the increasing number of iterations. V indicates V in the tth iteration. t+1

t

471 471

keep consistent with the previous setting, regularization smoothness regularization term β.experimental To discuss the effecteach of the regulariza-

472 472

parameter changesweover the range of 10[ 5:0.5:5]performance . tion parameters, consider the clustering of LNMFS under

474 474

In Fig. 6, we can thetypical same dataset, influence of the regdifferent settings of λ see andthat β onforfour datasetsthe (Figure 6). For consisularization parameters very similar patterns di↵erent evaluation tency with the previousshow experimental setting, eachacross regularization parameter

475 475

(−5:0.5:5) the influence of the regularization paramcriteria. over But for changes the di↵erent range 10datasets, .

473 473

34

A

B

C

D

(I) Wine

(II) Pendigits

(III) COIL20

(IV) NC vs AD

Figure 6: Influence of the regularization parameters (the low-rank regularization parameter λ and the graph smoothness parameter β) on the clustering performance of LNMFS in terms of four evaluation criteria. Each regularization parameter changes over the range 10(−5:0.5:5) . Parameter influence on (A) Rand index, (B) NMI, (C) F-measure and (D) purity.

476

In Fig. 6, it can be seen that for the same dataset, the influence of the

477

regularization parameters exhibit highly similar patterns in terms of different

478

evaluation criteria. However, for different datasets, the influence of the regu-

479

larization parameters varies. Despite the differences, the superior clustering

480

performance of LNMFS is achieved not only for a specific parameter setting

481

but also in a range of parameter settings. For instance, on the Wine dataset,

482

most of the parameter settings for LNMFS can lead to higher Rand index val-

35

483

ues than those by the other NMF-based algorithms. An empirical suggestion

484

for the regularization parameter setting of LNMFS is (β = 105 , λ = 101.5 ),

485

but better performance can be achieved if the regularization parameters are

486

optimized over the range 10(−5:0.5:5) .

487

4.6. Embedding Dimension

488

In the previous experiments, we set the embedding dimension of the NMF-

489

based algorithms to be the same as the number of clusters (Section 4.2).

490

We now study the clustering performance of LNMFS under different em-

491

bedding dimension settings, in comparison with the other nine NMF-based

492

algorithms on three typical datasets (Figure 7). For each algorithm, twenty

493

runs are conducted to obtain the average clustering performance. For each

494

embedding dimension, the regularization parameters (if any) are optimized

495

over the range 10(−5:0.5:5) .

496

In Figure 7, the ten NMF-based algorithms exhibit diverse clustering per-

497

formance curves on different datasets. LNMFS exhibits superior clustering

498

performance compared with the other algorithms for most embedding dimen-

499

sions and in terms of most evaluation criteria. An unexpected exception is on

500

the NC vs AD dataset, where the purity performance of SNMF is exception-

501

ally higher than that of the other algorithms when the embedding dimension

502

is larger than 3. The underlying reason is that SNMF partitions most of

503

the data points into a single cluster. The extraordinarily high purity and

504

F-measure of GRNMF when the embedding dimension equals 1 may be ex-

505

plained similarly. Therefore, purity and F-measure may be more susceptible

506

to unbalanced clustering results than Rand index and NMI.

507

On the same dataset, LNMFS exhibits similar performance curves with 36

A

B

C

D

(I) Wine

(II) COIL20

(III) NC vs AD

Figure 7: Clustering performance curves of ten NMF-based algorithms under different embedding dimension settings. (A) Rand index, (B) NMI, (C) F-measure, and (D) purity curves.

37

508

respect to different evaluation criteria. As the embedding dimension in-

509

creases, the clustering performance of LNMFS does not necessarily improve;

510

rather, LNMFS often achieves the optimal or near optimal clustering perfor-

511

mance when the embedding dimension equals the number of clusters (e.g.,

512

when d = 3 on Wine dataset, d = 20 on COIL20 dataset, and d = 2 on NC

513

vs AD dataset). This may be because when the embedding dimension co-

514

incides with the rank of the intrinsic data, the low-rank regularization term

515

can be maximally utilized. Nevertheless, this rule does not apply to the other

516

NMF-based algorithms. For instance, GNMF has the most stable clustering

517

performance curves under different embedding dimensions. In contrast with

518

LNMFS, standard NMF and RNMF exhibit inferior clustering performance

519

on NC vs AD dataset when the embedding dimension equals the cluster

520

number.

521

5. Conclusion and Future Study

522

The characteristic feature of NMF is its interpretable parts-based data

523

representation. In this study, we proposed a novel NMF-based algorithm

524

using three different types of additional constraints.

525

First, it addresses the problem of noise and outliers by incorporating

526

a low-rank constraint. This is different from existing robust NMF models

527

[46, 15, 20, 24], which assume that the introduced noise is sparse and impose

528

an l1 /l21 norm constraint on the noise matrix; rather, LNMFS penalizes the

529

nuclear norm of the intrinsic data matrix and transforms it into a sum of con-

530

vex Frobenius norms of factor matrices. This strategy was initially proposed

531

for low-rank matrix decomposition [33][6]. Although there are some NMF 38

532

models that also impose a low-rank constraint, they either impose it on a fac-

533

tor matrix [28] or aim for nonnegative matrix completion [35]. In addition,

534

they use ADMM to solve the optimization problem, which is considerably

535

slower and less concise than approximating with the Frobenius norm in a

536

unified objective function. Secondly, LNMFS incorporates the orthogonality

537

constraints on a Stiefel manifold into the model. The advantage of enhancing

538

orthogonality is that it can reduce the redundancy of the data representation

539

and improve the interpretability of the decomposition results. However, LN-

540

MFS does not use the retraction operator [1][37] because it remains an open

541

problem to ensure the convergence of NMF while retracting on the Stiefel

542

manifold. Finally, LNMFS integrates the graph smoothness constraint to

543

utilize the local geometry of the data manifold [7]. An alternative iterative

544

optimization algorithm was developed for LNMFS, and the proof of its con-

545

vergence was presented. Experiments on real-world datasets demonstrated

546

the superiority of LNMFS compared with nine representative NMF-based

547

algorithms on clustering tasks.

548

In future work, we will investigate the incorporation of user-provided

549

information (in terms of “must-link” and “cannot-link” constraints) into the

550

proposed LNMFS algorithm for semi-supervised NMF.

551

Acknowledgment

552

This work was supported in part by the Chinese National Natural Sci-

553

ence Foundation under Grant Nos. 61402395, 61906100 and 61802336, Nat-

554

ural Science Foundation of Jiangsu Province under contracts BK20151314,

555

BK20140492 and BK20180822, Jiangsu Overseas Research and Training Pro39

556

gram for University Prominent Young and Middle-aged Teachers and Pres-

557

idents, Jiangsu Government Scholarship for Overseas Studies, and Natural

558

Science Foundation of Education Department of Jiangsu Province under con-

559

tract 18KJB520040.

560

References

561

562

[1] Absil, P. A., Mahony, R., Sepulchre, R., 2009. Optimization algorithms on matrix manifolds. Princeton University Press.

563

[2] Bagga, A., Baldwin, B., 1998. Entity-based cross-document coreferenc-

564

ing using the vector space model. In: Meeting of the Association for

565

Computational Linguistics and International Conference on Computa-

566

tional Linguistics. pp. 79–85.

567

[3] Berry, M. W., Browne, M., Langville, A. N., Pauca, V. P., Plemmons,

568

R. J., 2007. Algorithms and applications for approximate nonnegative

569

matrix factorization. Computational Statistics & Data Analysis 52 (1),

570

155–173.

571

[4] Brouwer, T., Frellsen, J., Lio, P., 2017. Comparative study of inference

572

methods for bayesian nonnegative matrix factorisation. In: Machine

573

Learning and Knowledge Discovery in Databases - European Confer-

574

ence, ECML PKDD 2017, Skopje, Macedonia, September 18-22, 2017,

575

Proceedings, Part I. Skopje, Macedonia, pp. 513 – 529.

576

[5] Busovaca, E., Zimmerman, M. E., Meier, I. B., Griffith, E. Y., Grieve,

577

S. M., Korgaonkar, M. S., Williams, L. M., Brickman, A. M., 2016. Is

40

578

the alzheimers disease cortical thickness signature a biological marker

579

for memory? Brain Imaging & Behavior 10 (2), 1–7.

580

[6] Cabral, R., Torre, F. D. L., Costeira, J. a. P., Bernardino, A., 2013.

581

Unifying nuclear norm and bilinear factorization approaches for low-rank

582

matrix decomposition. In: Proceedings of the 2013 IEEE International

583

Conference on Computer Vision. ICCV ’13. IEEE Computer Society,

584

Washington, DC, USA, pp. 2488–2495.

585

URL http://dx.doi.org/10.1109/ICCV.2013.309

586

[7] Cai, D., He, X., Han, J., Huang, T. S., 2011. Graph regularized nonneg-

587

ative matrix factorization for data representation. IEEE Transactions

588

on Pattern Analysis and Machine Intelligence 33 (8), 1548–1560.

589

[8] Choi, S., 2008. Algorithms for orthogonal nonnegative matrix factoriza-

590

tion. In: IEEE International Joint Conference on Neural Networks. pp.

591

1828–1832.

592

[9] Comon, P., Jutten, C., 2010. Handbook of Blind Source Separation:

593

Independent Component Analysis and Separation. Academic Press.

594

[10] Cunningham, J. P., Ghahramani, Z., 2015. Linear dimensionality reduc-

595

tion: Survey, insights, and generalizations. Journal of Machine Learning

596

Research 16, 2859–2900.

597

URL http://jmlr.org/papers/v16/cunningham15a.html

598

599

[11] Dahnke, R., Yotter, R. A., Gaser, C., 2013. Cortical thickness and central surface estimation. Neuroimage 65 (1), 336–348.

41

600

[12] Ding, C., Li, T., Peng, W., Park, H., 2006. Orthogonal nonnegative

601

matrix t-factorizations for clustering. In: Proceedings of the 12th ACM

602

SIGKDD International Conference on Knowledge Discovery and Data

603

Mining. KDD ’06. pp. 126–135.

604

[13] Edelman, A., Arias, T. A., Smith, S. T., 1998. The geometry of algo-

605

rithms with orthogonality constraints. Society for Industrial and Applied

606

Mathematics.

607

[14] Eggert, J., Korner, E., 2004. Sparse coding and nmf. In: IEEE Inter-

608

national Joint Conference on Neural Networks, 2004. Proceedings. pp.

609

2529–2533 vol.4.

610

[15] Fvotte, C., Dobigeon, N., 2015. Nonlinear hyperspectral unmixing with

611

robust nonnegative matrix factorization. IEEE Transactions on Image

612

Processing 24 (12), 4810–9.

613

614

615

616

[16] Fvotte, C., Idier, J., 2011. Algorithms for nonnegative matrix factorization with the β-divergence. Neural Computation 23 (9), 2421 – 2456. [17] Hoyer, P. O., 2004. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research 5 (1), 1457–1469.

617

[18] Jalili, M., 2016. Graph theoretical analysis of alzheimer’s disease: dis-

618

crimination of ad patients from healthy subjects. Information Sciences

619

384, S0020025516306193.

620

[19] Jolliffe, I. T., Cadima, J., 2016. Principal component analysis: a review

621

and recent developments. Philos Trans A Math Phys Eng Sci 374 (2065),

622

20150202. 42

623

[20] Kong, D., Huang, H., Huang, H., 2011. Robust nonnegative matrix fac-

624

torization using l21-norm. In: ACM International Conference on Infor-

625

mation and Knowledge Management. pp. 673–682.

626

[21] Lathauwer, L. D., Moor, B. D., Vandewalle, J., 2000. A Multilinear Sin-

627

gular Value Decomposition. Society for Industrial and Applied Mathe-

628

matics.

629

630

[22] Lee, D. D., Seung, H. S., 1999. Learning the parts of objects by nonnegative matrix factorization. Nature 401 (6755), 788.

631

[23] Li, T., Ding, C., 2007. The relationships among various nonnegative ma-

632

trix factorization methods for clustering. In: International Conference

633

on Data Mining. pp. 362–371.

634

[24] Li, X., Cui, G., Dong, Y., 2017. Graph regularized non-negative low-

635

rank matrix factorization for image clustering. IEEE Transactions on

636

Cybernetics 47 (11), 3840–3853.

637

[25] Li, Z., Wu, X., Peng, H., 2010. Nonnegative matrix factorization on

638

orthogonal subspace. Pattern Recognition Letters 31 (9), 905–911.

639

[26] Macovski, A., 1996. Noise of mri. Magnetic Resonance Medcine 36 (3),

640

641

642

643

494–497. [27] Manning, C. D., Raghavan, P., Schtze, H., 2008. Introduction to Information Retrieval. Cambridge University Press. [28] Pan, J., Ng, M. K., 2018. Orthogonal nonnegative matrix factorization

43

644

by sparsity and nuclear norm optimization. SIAM Journal on Matrix

645

Analysis and Applications 39 (2), 856875.

646

[29] Racine, A. M., Brickhouse, M., Wolk, D. A., Dickerson, B. C., 2018.

647

The personalized alzheimer’s disease cortical thickness index predicts

648

likely pathology and clinical progression in mild cognitive impairment.

649

Alzheimers & Dementia Diagnosis Assessment & Disease Monitoring,

650

S2352872918300150.

651

652

[30] Rijsbergen, C. J. V., 1974. Foundation of evaluation. Journal of Documentation 30 (4), 365–373.

653

[31] Schmidt, M. N., Larsen, J., Hsiao, F. T., 2007. Wind noise reduction

654

using non-negative sparse coding. In: Machine Learning for Signal Pro-

655

cessing, 2007 IEEE Workshop on. pp. 431–436.

656

[32] SL, D., GW, V. H., MD, C., A., P., 2010. Parcellation of human tem-

657

poral polar cortex: A combined analysis of multiple cytoarchitectonic,

658

chemoarchitectonic, and pathological markers. Journal of Comparative

659

Neurology 514 (6), 595–623.

660

661

[33] Srebro, N., Shraibman, A., 2005. Rank, trace-norm and max-norm. In: Conference On Learning Theory. pp. 545–560.

662

[34] Strehl, A., Strehl, E., Ghosh, J., Mooney, R., 2000. Impact of simi-

663

larity measures on web-page clustering. In: In Workshop on Artificial

664

Intelligence for Web Search (AAAI 2000. AAAI, pp. 58–64.

665

[35] Sun, D. L., Mazumder, R., 2013. Non-negative matrix completion for 44

666

bandwidth extension: A convex optimization approach. In: IEEE In-

667

ternational Workshop on Machine Learning for Signal Processing. pp.

668

1–10.

669

[36] Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F.,

670

Etard, O., Delcroix, N., Mazoyer, B., Joliot, M., 2002. Automated

671

anatomical labeling of activations in spm using a macroscopic anatomi-

672

cal parcellation of the mni mri single-subject brain. Neuroimage 15 (1),

673

273–289.

674

675

[37] Vandereycken, B., 2013. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization 23 (2), 12141236.

676

[38] Wang, K., Liang, M., Wang, L., Tian, L., Zhang, X., Li, K., Jiang,

677

T., 2007. Altered functional connectivity in early alzheimer’s disease: a

678

resting-state fmri study. Human Brain Mapping 28 (10), 967.

679

[39] Wang, Y. X., Zhang, Y. J., 2013. Nonnegative matrix factorization: A

680

comprehensive review. IEEE Transactions on Knowledge & Data Engi-

681

neering 25 (6), 1336–1353.

682

683

[40] Weller, J., Budson, A., 2018. Current understanding of alzheimers disease diagnosis and treatment. F1000research 7, 1161.

684

[41] Wright, J., Ganesh, A., Rao, S., Peng, Y. G., Ma, Y., 2009. Robust

685

principal component analysis: Exact recovery of corrupted low-rank ma-

686

trices via convex optimization. In: International Conference on Neural

687

Information Processing Systems. pp. 2080–2088.

45

688

[42] Wu, W., Kwong, S., Yu, Z., Jia, Y., Wei, G., 2018. Nonnegative ma-

689

trix factorization with mixed hypergraph regularization for community

690

detection. Information Sciences 435, 263–281.

691

[43] Xia, M., Wang, J., Yong, H., 2013. Brainnet viewer: A network visual-

692

ization tool for human brain connectomics. Plos One 8 (7), e68910.

693

[44] Yang, J., Yang, S., Fu, Y., Li, X., Huang, T., 2008. Non-negative graph

694

embedding. In: Computer Vision and Pattern Recognition, 2008. CVPR

695

2008. IEEE Conference on. pp. 1–8.

696

[45] Zhang, K., Zhang, S., Liu, J., Wang, J., Zhang, J., 2019. Greedy or-

697

thogonal pivoting algorithm for non-negative matrix factorization. In:

698

Proceedings of the 36th International Conference on Machine Learning.

699

pp. 7493–7501.

700

[46] Zhang, L., Chen, Z., Zheng, M., Xiaofei, H. E., 2011. Robust non-

701

negative matrix factorization. Frontiers of Electrical & Electronic Engi-

702

neering in China 6 (2), 192–200.

703

[47] Zhang, T., Fang, B., Tang, Y. Y., He, G., Wen, J., 2008. Topology

704

preserving non-negative matrix factorization for face recognition. IEEE

705

Transactions on Image Processing A Publication of the IEEE Signal

706

Processing Society 17 (4), 574–84.

707

[48] Zhang, W. E., Tan, M., Sheng, Q., Yao, L., Shi, Q., 10 2016. Efficient

708

orthogonal non-negative matrix factorization over stiefel manifold. In:

709

Proceedings of the 25th ACM International on Conference on Informa-

710

tion and Knowledge Management. pp. 1743–1752. 46

711

[49] Zhang, X., Gao, H., Li, G., Zhao, J., Li, Z., 2018. Multi-view clustering

712

based on graph-regularized nonnegative matrix factorization for object

713

recognition. Information Sciences 432, 463–478.

714

[50] Zhao, R., Tan, V. Y. F., 2016. Online nonnegative matrix factoriza-

715

tion with outliers. In: Proceeding of IEEE International Conference on

716

Acoustics, Speech and Signal Processing. pp. 2662–2666.

717

Author Contributions

718

Ping He contributed to developing and implementing the method, con-

719

ducting experiments, visualizing and analyzing the results, and writing the

720

manuscript. Xiaohua Xu contributed to developing and implementing the

721

method, providing theoretical analysis of the method, and directing the work.

722

Jie Ding contributed to the revision of the manuscript. Baichuan Fan con-

723

tributed to conducting the experiments.

724

Declaration of interests

725

The authors declare that they have no known competing financial inter-

726

ests or personal relationships that could have appeared to influence the work

727

reported in this paper.

47