A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data

A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data

INS 10858 No. of Pages 16, Model 3G 12 May 2014 Information Sciences xxx (2014) xxx–xxx 1 Contents lists available at ScienceDirect Information Sc...

2MB Sizes 2 Downloads 68 Views

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Information Sciences xxx (2014) xxx–xxx 1

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins 5 6

A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data

3 4 7

Q1

8

Yin Zhang ⇑, Deng Yi, Baogang Wei, Yueting Zhuang College of Computer Science, Zhejiang University, 38 ZheDa Road, Hangzhou, China

9

a r t i c l e

1 2 1 4 12 13 14 15 16

i n f o

Article history: Received 1 February 2013 Received in revised form 13 April 2014 Accepted 27 April 2014 Available online xxxx

17 18 19 20 21 22 23

Keywords: Non-negative Sparse LSA Social tagging GPU computing Tag recommendation Image classification

a b s t r a c t Nowadays large-scale social tagging data have become very valuable in organizing and indexing multimedia resources. In this paper, we apply Non-negative Sparse Latent Semantic Analysis (NN-Sparse LSA) to discover the latent semantic space behind associations between multimedia resources and tagging data. Based on the traditional coordinate-descent algorithm, column-orthogonality and non-negative constraints, we derive a much faster optimization algorithm in theory for solving the NN-Sparse LSA model. Furthermore, we implement the parallel version of our fast NN-Sparse LSA algorithm using the NVIDIA CUDA (Compute Unified Device Architecture) parallel programming framework and a data partitioning scheme that effectively reduces the memory traffic between the global memory of the Graphic Processing Unit (GPU) and the host memory. The experimental results on image classification and tag recommendation tasks on MIRFLICKR and NUS-WIDE datasets show that our parallelized fast optimization algorithm can achieve comparable or even better performance than the other examined methods, while speeds up the original optimization algorithm 20–110 times. Ó 2014 Published by Elsevier Inc.

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

42 43

1. Introduction

44

Along with the burgeoning of social tagging web sites, tagging has become a pervasive service for web objects and multimedia resources. On the most famous social tagging sites like Delicious1 and Flickr,2 users can assign tags to a particular bookmarked URL or photos uploaded by themselves or others. Unlike the traditional features such as text description, color, texture and shape of the objects in the images, tags are more intuitional and therefore reflect the rich semantics of these resources from a user–owner perspective. Hence, social tagging data can be used to improve tag recommendation (e.g., see [17,24,29,34,36]) and multimedia retrieval (e.g., see [6,20,35,38]). Based on large scale social tagging data, web objects or multimedia resources can be easily represented in a meaningful tag feature space. However, it is not easy to effectively and efficiently utilize these social tagging features. One reason is that tagging is not restricted to a certain rule, users can pick any tags they like to describe resources. Hence, some tags can be inconsistent and idiosyncratic, due to the users’ personal terminology as well as other different purposes [12]. Moreover, although we have to handle millions of multimedia resources and tag features, tags for a specific resource are often few in number, i.e., the notorious problem of tag data sparsity [29].

45 46 47 48 49 50 51 52 53 54 55

Q3 Q2 Q1

⇑ Corresponding author. Tel.: +86 13136157369; fax: +86 57187953779. 1 2

E-mail addresses: [email protected] (Y. Zhang), [email protected] (D. Yi), [email protected] (B. Wei), [email protected] (Y. Zhuang). http://delicious.com/. http://www.flickr.com.

http://dx.doi.org/10.1016/j.ins.2014.04.047 0020-0255/Ó 2014 Published by Elsevier Inc.

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80

2

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

In order to solve the problem of tag noise and sparsity, dimension reduction methods are often employed as a preprocessing step in the social image classification task, aiming to reduce the high-dimensional social tag space to a low-dimensional latent topic space, which can also unveil some latent patterns from image–tag dyadic datasets. Recently, many dimension reduction methods and their extensions have been successfully utilized in social tagging applications (see for examples, [17,21,26,31,36,38,40]). In this paper, we apply an unsupervised dimension reduction method named Non-negative Sparse LSA (NN-Sparse LSA) [7] to discover the latent semantic space behind associations between multimedia resources and tagging data. NN-Sparse LSA achieves significant improvement against the traditional LSA models in terms of projection computation, storage costs and the ability to generate sensible topics [7]. However, the typical coordinate-descent optimization algorithm for the NNSparse LSA model is not scalable and efficient. In order to deal with large-scale social tagging data, we propose a GPU-accelerated fast optimization algorithm for the NN-Sparse LSA model. The main contributions of this paper are summarized as follows: 1. We derive a fast optimization algorithm to analytically calculate the projection matrix in the NN-Sparse LSA model from the traditional coordinate descent algorithm, non-negative and column orthogonality constraints. Compared to the original one, our proposed algorithm reduces the computational complexity from quadratic of dimensionality of latent space to linear, while is easier to be computed in parallel. 2. We propose a parallel implementation of the fast NN-Sparse LSA algorithm using NVIDIA CUDA and GPU card. To handle the large-scale social tagging data within the limited global memory of the GPU, we adopt a well-designed data partitioning scheme that effectively handles out-of-core matrices and reduces the data traffic between the global GPU memory and the host memory during the parallel processing. 3. We conducted tag recommendation and image classification experiments on two real social tagging datasets, comparing our method with other popular techniques, such as Latent Semantic Analysis (LSA) [10], Sparse Coding [18] and Latent Dirichlet Allocation [2]. The experimental results show that our proposed approach achieves comparable or even better performance than other examined methods, while it offers a 20–110 speed-up compared to the original NN-Sparse LSA algorithm.

81

86

The rest of this paper is organized as follows: Section 2 briefly reviews some related works. Section 3 presents the NNSparse LSA model and its applications in tag recommendation and image classification tasks. Section 4 presents how to derive our fast NN-Sparse LSA algorithm. Section 5 introduces the GPU-based implementation of our fast algorithm using the CUDA programming framework. Section 6 shows our experimental evaluations and results. Finally, we conclude our work in Section 7.

87

2. Related work

88 90

Recently, many dimension reduction models and their extensions have been proposed and applied in social tagging applications. In this section, we will review some of these models and applications. Moreover, we will also report some parallel implementations of them.

91

2.1. Social tagging applications using dimension reduction models

92

An effective approach for assistive tagging [34] is tag recommendation. Its related studies can be roughly classified into two categories: personalized tag recommendation [24,31,32] and collective tag recommendation [17,36,38]. For personalized tag recommendation, a recent book [24] on tag recommender systems summarizes a wide variety of approaches, one of which is the tensor factorization approach [31,32], whose idea is to cast the ternary relationship (i.e., user, resource, tag) as a third-order tensor completion problem and employ the high-order Singular Value Decomposition (SVD) techniques for tensor reduction and reconstruction. For collective tag recommendation, one state-of-the-art approach is tag completion [17,36,38]. This approach represents the relations between tags and resources as a matrix, and its goal is to automatically predict the missing tags in the sparse tag–resource matrix. Wu et al. [38] propose a tag completion algorithm by ensuring the optimal completed tag matrix to be consistent with both the observed tags and the visual similarity. Wang et al. [36] proposes a novel efficient Hashing approach for Tag Completion and Prediction (HashTCP). In [17], Krestel et al. employ LDA [2] to elicit latent topic–resource relationships and tag–topic relationships to recommend tags for new resources. In this paper, we investigate the tag completion problem and propose a fast NN-Sparse LSA method for predicting the missing tags for new resources. Numerous social multimedia classification and search methods can be roughly classified into three categories: contentbased [19,25,33], tag-based [21] and hybrid methods [6,35]. Compared to content-based methods, tag-based methods are usually not only more effective but also more efficient. Chen et al. [6] propose a new tag-based image retrieval framework to improve the retrieval performance of a group of related personal images. In [35], the semantic similarities of social images are estimated based on their tags. Lin et al. [21] introduce a Sparse Tag Patterns (STP) model to discover the compact yet discriminative representation for both tag pattern-tag and text-tag pattern relationships from large scale user contributed

82 83 84 85

89

93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

3

114

tags among social data. Nevertheless, there are still many alternative dimension reduction methods to be explored, such as LSA [7], LSI [13], Non-negative Matrix Factorization (NMF) [3,23], Sparse Coding [18]. In this paper, we therefore propose a GPU-based NN-Sparse LSA technique for tag-based image classification by enforcing the column-orthogonality and non-negative constraints on the traditional LSA model.

115

2.2. Parallel implementations

116

129

So far, few works focus on how to improve the efficiency of dimension reduction models with parallel computing techniques. A parallel C++ implementation of the LDA technique that utilizes a distributed computing environment with 2048 CPUs, has been presented in [22]. Smola and Narayanamurthy [30] describe a high performance sampling architecture for inference of latent topic models on a cluster of workstations. The learning process of Regularized Latent Semantic Indexing (RLSI) has been parallelized via MapReduce in [37]. Unlike these works, in this paper we present how to utilize the NVIDIA CUDA-enabled3 GPU cards in order to achieve a significant boost in time performance of the NN-Sparse LSA model. Recently, parallel processing techniques based on GPU and CUDA4 have been widely studied and applied in various scientific fields [27]. CUDA-capable GPUs have a parallel throughput architecture that allows the concurrent execution of many different threads, for general purpose processing. In [4], Cavanagh et al. present a parallel LSA implementation on GPU, which utilizes a Lanczos algorithm to conduct the SVD method. A GPU-accelerated PCA based on the Gram-Schmidt orthogonalization algorithm has been proposed in [1]. Yan et al. [39] study the problem of parallelizing the inference method for the LDA model on GPU, while a GPU-based NMF algorithm has been proposed in [23]. Unlike previous studies, we implement a GPU-based version of our fast optimization algorithm for the NN-Sparse LSA model using the CUDA framework. To the best of our knowledge, implementing this model on GPU has not been studied previously.

130

3. The Non-negative Sparse LSA model and its applications to social tagging data

131

137

Formally, we can represent social tagging data with a resource–tag relationship matrix X 2 RMN , where M is the number of resources (such as images, web pages, and videos) and N is the number of tags. Each entry X ij of this matrix is a score (that is calculated for example by counting the number of links) which quantifies the association between a resource ri and a tag t j . In this paper, we apply the NN-Sparse LSA technique on those social tagging data, in order to automatically discover the most appropriate low-dimensional latent space embedded in the resource–tag relationship matrix X. Projecting a resource ri from the original tag space to a low-dimensional latent space often results in better performance and efficiency. In the following, we will describe the NN-Sparse LSA model and its applications to tag recommendation and image classification tasks.

138

3.1. The Non-negative Sparse LSA model

139

Given the dimensionality of the latent space DðD 6 minfN; MgÞ, the objective of NN-Sparse LSA model is to find a sparse and non-negative projection matrix A (where X  UA, and we enforce the column orthogonality constraint on U, i.e. UT U ¼ I) that performs the transformation and mapping of the resources from the tag space to the latent topic space. Due to the enforced sparsity constraint on the projection matrix A, this model can obtain more interpretable tag–topic relationships. Moreover, the projection matrix A is non-negative, so each tag has a pseudo probability of belonging to a specific topic, similar to LDA [2]. More formally, the formulation of the NN-Sparse LSA model is as follows:

111 112 113

117 118 119 120 121 122 123 124 125 126 127 128

132 133 134 135 136

140 141 142 143 144 145

146

min U;A

150 151 152 153

154

158 159 160

;

ð1Þ

where U 2 RMD is the low-rank matrix of latent variables, A 2 RDN is the so-called projection matrix, D is the dimensionP P ality of the learned latent space, k  kF is the Frobenius norm, kAk1 ¼ Dd¼1 Nj¼1 jAdj j is the entry-wise l1 -norm of A; A P 0 is the non-negative constraint, k is the positive regularization parameter which controls the density (the number of non-zero entries) of A. In order to obtain the most relevant tags to a specific topic, we just need to normalize each row of A to 1:

e dj ¼ P Adj : A N j¼1 Adj

156 157

 UAk2F þ kkAk1

subject to : UT U ¼ I; A P 0

148 149

1 kX 2

ð2Þ

e dj can be viewed as a Since Adj measures the relevance of the j-th tag to the d-th topic, from a probability perspective, A pseudo probability Pðtj jwd Þ of the tag tj given the topic wd . ^ via the operation Moreover, the NN-Sparse LSA could project a novel data q 2 RN into a sparse vector representation q ^ ^. q ¼ Aq, then the most relevant topics can be identified based on the non-zero entries in vector q 3 4

https://developer.nvidia.com/category/zone/cuda-zone. https://developer.nvidia.com/cuda-gpus.

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1 161

4

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

Algorithm 1. Traditional Optimization Algorithm for the NN-Sparse LSA Input: X, the dimensionality of latent space D, regularization parameter k   ID Initialization: U0 ¼ 0 Iterate until convergence of U and A: 1. Apply the coordinate descent method to solve A 2. Project X into the latent space:V ¼ XAT 3. Compute the SVD of V: V ¼ PDQ , let U ¼ PQ Output: Sparse projection matrix A

170 171 172 173 174 175

177 178 179 180 181

The traditional optimization algorithm for solving Eq. (1) is shown in Algorithm 1, which includes two steps in each iteration: solve A when matrix U is fixed, and vice vesa. Firstly, when the matrix U is fixed, Eq. (1) with respect to A can be decomposed into N independent sub-problems, each one corresponding to a column of A:

min f ðAj Þ ¼ Aj P0

ð3Þ

where Aj denotes the j-th column of projection matrix A; Adj denotes the d-th element in Aj . We can apply the coordinate descent method to minimize the function f ðAj Þ. Secondly, when the projection matrix A is fixed, Eq. (1) is equivalent to:

min U

183

D X 1 kX j  UAj k22 þ k Adj ; 2 d¼1

1 kX 2

 UAk2F

subject to : UT U ¼ I

ð4Þ

:

190

To solve Eq. (4), we can define a matrix V and let V ¼ XAT . Assuming that the SVD of V is calculated by V ¼ PDQ , then the optimal solution is U ¼ PQ (see proof in [7]). The whole algorithm proceeds in an alternating manner until the convergence of both matrices U and A. The stopping criterion is often defined as follows: calculate the maximum change in matrix U and A respectively for two consecutive iterations t and t þ 1, which can be computed as kUðtþ1Þ  Ut k1 and kAðtþ1Þ  At k1 , where k  k1 denotes the matrix entry-wise l1 -norm. If the maximum change in both matrices U and A is smaller than a predefined constant c, the loop will end up with the final projection matrix A.

191

3.2. Applying the Non-negative Sparse LSA to tag recommendation

192

In this section, we introduce how to apply the NN-Sparse LSA model to the tag recommendation task. Tag recommendation can relieve users from the eventually time consuming task of coming up with a good set of tags, since recognizing which tags to use for annotating a given resource requires far less cognitive effort than conceiving [24]. In order to effectively recommend high-quality tags to a given resource, in the training stage, we apply the NN-Sparse LSA model on the resource–tag relation matrix X (training set) to obtain a projection matrix A. From a topic modeling perspece dj after the row tive, the projection matrix A can be seen as the topic–tag relationship matrix. As described in Section 3.1, A normalization of the projection matrix A denotes a pseudo probability Pðtj jwd Þ of tag t j given topic wd . Based on these pseudo probabilities, we can find the most relevant tags belonging to a specific latent topic. In the inference stage, we project those new resources with only a few tags (testing set) from the tag feature space to the e id after row normalization of V denotes a pseudo probability Pðwd jri Þ low-dimensional latent topic space by V ¼ XAT . Then, V of topic wd given new resource ri . Hence, the probability Pðtj jr i Þ of tag t j given resource ri can be formulated as:

184 185 186 187 188 189

193 194 195 196 197 198 199 200 201 202 203

205

Pðt j jr i Þ ¼

D D X X e dj V e id ; Pðt j jwd ÞPðwd jri Þ ¼ A d¼1

ð5Þ

d¼1

207

where the number of latent topics D needs to be defined in advance. The final recommendations to resource ri are those tags with the top-N largest probability values of Pðt j jr i Þ.

208

3.3. Applying the Non-negative Sparse LSA to image classification

209

In this section, we introduce how to apply the NN-Sparse LSA model to classify images annotated by various users in the social tagging web site. We first transform the original social tagging data into an image–tag relationship matrix X, where X ij is calculated in a TFIDF way to reflect how important the tag t j is to the image ri . Hence, each image r i can be represented with a tag feature vector X i . We then apply our NN-Sparse LSA algorithm on the image–tag relationship matrix X and end up with a non-negative sparse projection matrix A.

206

210 211 212 213 214

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

5

215

In the classification stage, we calculate the low-dimensional compact representation V i of each image ri by applying pro-

216 218

jection V i ¼ X i AT to its high-dimensional tag representation X i . Afterwards, we train the linear SVM classifier [5] using as ground-truth the low-dimensional representations V i ; i ¼ 1; . . . ; N train in the training set and we employ it for the classification of the low-dimensional representations of the images in the testing set.

219

4. Fast optimization algorithm for NN-Sparse LSA

220 222

In this section, we present how to derive a much faster optimization algorithm for solving A in theory, based on the traditional coordinate descent method and the non-negative and column orthogonality constraints of the NN-Sparse LSA model. From a computational complexity perspective, the cost of solving A in the first step is OðtMND2 Þ, where t is the total num-

223

ber of iterations of the coordinate descent method, and that of calculating U in the second step is OðN 3 Þ. If the total number of

224

iterations of the whole algorithm is T, the whole computational cost is OðtMND2 T þ N 3 TÞ. Hence, the traditional optimization algorithm for NN-Sparse LSA model does not run fast, especially in social tagging related applications where the number of resources or associated tags is often over 100 K. The problem of computing A can be decomposed into N independent sub-problems as shown in Eq. (3). The traditional coordinate descent optimization method for solving each sub-problem is as follows: first initialize A0j 2 RD with a vector of random values, then in each iteration, pick up each dimension d in Aj in turn, and optimize f ðAj Þ with respect to its d-th dimension while keeping other entries in Aj fixed. Formally, if Atj is obtained in the t step, then the d-th dimension of Atþ1 in ðt þ 1Þ step is given by: j

217

221

225 226 227 228 229 230 231 232

234 235 236

238 239 240

242 243 244 245 246 247

  tþ1 tþ1 t t Atþ1 dj ¼ argmin f A1j ; . . . ; A½d1j ; y; A½dþ1j ; . . . ; ADj y

ð6Þ

To solve the problem in (6), we need to calculate the gradient of f ðAj Þ with respect to a specific Adj :

@f ðAj Þ ¼ cd Adj  Bdj þ k; @Adj where cd ¼

Bdj ¼

PM

2 i¼1 U id ; U id

M X

is the i-th row and d-th column of the latent variables matrix U.

U id X ij 

i¼1

ð7Þ

D X

!

U ik Akj ;

ð8Þ

k–d

where X ij denotes the i-th element in the j-th column of matrix X. B k @f ðA Þ @f ðA Þ It is easy to verify that when Bdj > k, setting Adj ¼ djc will make @A j ¼ 0. On the other hand, if Bdj 6 k; @A j P 0 for all d dj dj Adj P 0 (recall the non-negative constraint of the NN-Sparse LSA model), so the minimization of f ðAj Þ is achieved when  Adj ¼ 0. We summarize the optimal solution Adj as follows:

(B

Adj

249

¼

dj k

cd

0

Bdj > k Bdj 6 k

ð9Þ

251

The iterative coordinate descent algorithm will end up to the final solution Aj when the maximum change in Aj for two ðtþ1Þ consecutive iterations t and t þ 1, formally kAj  Atj k1 , is smaller than a prefixed constant.

252

Algorithm 2. Fast Optimization Algorithm to Calculate Matrix A

250

Input: Matrix X and U, the value of M, N; D and k 1: for d ¼ 1; 2; . . . ; D do P 2 2: Compute cd ¼ M i¼1 U id 3: end for 4: Initialize A ¼ B ¼ 0 5: for d ¼ 1; 2; . . . ; D do 6: for j ¼ 1; 2; . . . ; N do P 7: Compute Bdj ¼ M i¼1 U id X ij 8: if ðBdj > kÞ then 9:

Update Adj ¼

Bdj k cd

10: else 11: Update Adj ¼ 0 12: end if 13: end for 14: end for Output: Sparse projection matrix A 270

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1 271 272 273 274 275 276 277 278 279 280

6

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

Obviously, the main cost of solving Adj comes from the calculation of Bdj , whose computational complexity is OðMDÞ. Moreover, if the number of iterations is t and the number of tags is N, then the computational complexity of calculating A is OðtMND2 Þ, which is extremely expensive, especially in the case of large scale social tagging data. Fortunately, we will prove that the calculation of Bdj is not as complex as it looks and does not depend on Akj ; k – d, under the column orthogonality and non-negative constraints of the NN-Sparse LSA model, which leads to a much faster method for calculating Bdj (see line 7 in Algorithm 2) than the original algorithm. Algorithm 2 presents our fast optimization algorithm for calculating matrix A. The details of proof are as follows: Suppose that we want to calculate the d-th coordinate of Atþ1 in (t + 1)-th iteration j step, we first need to calculate the corresponding Bdj following the aforementioned steps. We can rewrite Eq. (8) in the following form:

281

283 284

Bdj ¼

M M d1 D X X X X U id X ij  U id U ik Atþ1 U ik Atkj kj þ i¼1

i¼1

k¼1

288

We denote the second part in Eq. (10) as F dj and rewrite it as follows:

F dj ¼

292 293

M X

U id

i¼1

d1 X k¼1

tþ1 U ik Akj þ

!!

D X

U ik Atkj

¼

k¼dþ1

! d1 X M D X X tþ1 U id U ik Akj þ k¼1

i¼1

k¼dþ1

! M X U id U ik Atkj

ð11Þ

i¼1

If we expand the formulation of Eq. (11), F dj could be rewritten as:

289

291

ð10Þ

:

k¼dþ1

285

287

!!

F dj ¼

! ! ! ! M M M M X X X X tþ1 t t A A U id U i1 Atþ1 þ    þ U U þ U U þ    þ U U id iðd1Þ id iðdþ1Þ id iD ADj : 1j ½d1j ½dþ1j i¼1

i¼1

i¼1

ð12Þ

i¼1

Recall that matrix U is a column orthogonal matrix ðUT U ¼ IÞ in the NN-Sparse LSA model, so the item PM i¼1 U id U ik ðk ¼ 1; . . . ; D; k – dÞ in Eq. (12) could be expressed as:

294

296 297 298 299 300

301

303

M X U id U ik ¼ U Td U k ¼ 0 ðk ¼ 1; . . . ; D; k – dÞ:

ð13Þ

i¼1

Plugging Eq. (13) into Eq. (12), we could get F dj ¼ 0. However, due to the computational error in the practical calculation, the value in Eq. (13) will infinitely approach 0, so the value of F dj is infinitely small. For this reason, we drop off the calculation of F dj which is trivial for the final value of Bdj and clearly saves a huge calculation cost. According to the above inference, Bdj could be rewritten as:

Bdj ¼

M X U id X ij :

ð14Þ

i¼1

304

It is straightforward that the computational cost of calculating Bdj is reduced to OðMÞ, which is the linear complexity.

305

tþ1 t Moreover, the calculation of Atþ1 dj is irrelevant to the values of Apj ðp ¼ 1; . . . ; d  1Þ and values of Aqj ðq ¼ d þ 1; . . . ; DÞ. Hence,

310

our approach directly calculates the optimal solution of A in Eq. (3) by multiplying UT and X, and avoids the cost of t iterations in the traditional iterative approach to get the optimal solution of Adj in Eq. (6). Based on our optimization algorithm for NN-Sparse LSA, the computational cost of calculating A is reduced to OðMNDÞ. Furthermore, the optimization over Adj (d = 1, . . . , D) in our approach can also be decomposed into D independent sub-problems, thus making the parallel computation of our fast optimization algorithm much easier.

311

5. GPU implementation of fast optimization algorithm

312

In this section, we introduce how to parallelize our fast optimization algorithm for the NN-Sparse LSA model on GPU using the CUDA framework. Although each calculation of Bdj as shown in Algorithm 2 seems very simple, a large number of resources and tags still give rise to huge amounts of computation and storage costs. The main issue we encounter is the limited size of the global memory on GPU versus the huge matrix composed of resources and tags, we therefore have to employ a data partitioning scheme in order to handle out-of-core matrices on GPU. We briefly review the CUDA programming model before introducing our parallel implementation. A CUDA program consists of one or more kernels which are suitable for parallel execution on GPU. Each kernel is executed by a grid of thread blocks and each thread block contains a fixed number of threads. The threads per thread block can be synchronized with each other and have a high-speed shared memory for inter-thread communication. To execute the program on GPU, we should first copy the data from the host memory to the global GPU memory.

306 307 308 309

313 314 315 316 317 318 319 320 321

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1 322

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

7

Algorithm 3. Parallel Algorithm to Calculate Matrix A Input: Matrix X and U and the value of M, N; D; k. (Suppose that the matrix size has been adjusted.) /⁄ Compute each cd in parallel ⁄/ for d ¼ 1; 2; . . . ; D do Copy d-th column of U to GPU global memory Utilize a parallel sum reduction to compute cd Copy the result of cd back to the host memory end for Initialize: Matrix A ¼ zerosðD; NÞ, BlockSize = 16 /⁄ Divide matrix into groups according to the size of GPU global memory ⁄/ U ¼ ½U 1 ; U 2 ; ::; U i ; ::; U m ; X ¼ ½X 1 ; X 2 ; ::; X j ; ::; X n , A ¼ ½A11 ; A12 ; ::; Aij ; ::; Amn  /⁄ For each group in matrix A ⁄/ for i ¼ 1; 2; . . . ; m do for j ¼ 1; 2; . . . ; n do Copy U i ; X j ; Aij to GPU global memory Compute Aij on GPU Copy Aij back to the host memory end for end for Output: Sparse projection matrix A

344

345

5.1. Calculating matrix A in parallel

346

371

For calculating A in parallel, Algorithm 3 presents our GPU-based parallel version of Algorithm 2. The first step is to comP 2 pute cd ¼ M i¼1 U id D times in parallel (see line 2 in Algorithm 2). We slightly modify a parallel sum reduction algorithm available in the NVIDIA GPU Computing SDK 5, to speed up the summation of a large array of values. Intuitively, the remaining steps (see lines 5–14 in Algorithm 2) are basically a matrix multiplication operation. So it is reasonable to parallelize them by distributing the processing of each element in matrix A across many threads. However, the storage space of matrix U and X in the host memory is usually much larger than that of the global memory on the GPU, which makes it impossible to obtain matrix A just by one multiplication operation. For this reason, we propose a data partitioning scheme to overcome the limited memory constraint of the GPU. Fig. 1 shows the hierarchical structure of our parallel algorithm. First, we divide matrix A into dozens of groups. Each group in matrix A is calculated by the corresponding group in matrix U and that in matrix X. As shown in Fig. 1(a) group A13 is the multiplication of U Group1 and X Group3 . One GPU grid will process one group at a time. Second, we further divide each GPU grid into several two-dimensional blocks, each of which contains multiple threads and calculates one tile (see Asub in Fig. 1(b)) in a specific group of matrix A. Asub is calculated by the product of the corresponding U-tile and X-tile. To utilize the high-speed but small-size per-block shared memory for reducing global memory traffic, we further divide its U-tile and X-tile into smaller sub-tiles (see u; v ; w and u0 ; v 0 ; w0 in Fig. 1(b)). The size of the sub-tiles is chosen so that they can fit into the shared memory. Finally, product calculations include several phases based on fine-grained partitioning (see Phase 1, Phase 2 and Phase 3 in Fig. 1(c)). In each phase, all threads in one block cooperate to load a pair of corresponding sub-tiles (e.g., u and u0 ) into the shared memory; then, the inner product of the considered pair of sub-tiles is calculated and added up in the corresponding position of Asub (i.e., Asub ¼ uu0 þ vv 0 þ ww0 ). After these phases end, all threads in a block further cooperate to update the appropriate elements in Asub according to the value of k (see lines 8–12 in Algorithm 2), and write the final values back to the global memory. In our experiments, the size of each GPU thread block is 16  16. In order to divide each group appropriately, the size of all three matrices ðU; A; XÞ should be an integer multiple of the block size. We can simply expand the matrix with 0 to make the size match this condition. Moreover, the data type of GPU calculation is single-precision, so we further utilize the Kahan’s Summation Formula [16] to improve the calculation accuracy.

372

5.2. Calculating matrix U in parallel

373

As shown in Algorithm 1, most of the computational cost for calculating matrix U is spent on the following two parts: matrix multiplication ðV ¼ XAT ; U ¼ PQ Þ and singular value decomposition ðV ¼ PDQ Þ. The main idea to parallelize matrix

347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370

374

5

https://developer.nvidia.com/gpu-computing-sdk.

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1

8

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

Fig. 1. The hierarchical structure of our parallel algorithm. Matrix A is first divided into several groups and one grid on the GPU processes one group at a time as shown in subgraph (a). Then, each group is further divided into several tiles and blocks in one grid, compute these tiles in parallel as shown in subgraph (b). The process of calculating one tile by a fixed number of threads in one block is shown in subgraph (c).

375 376 377 378 379 380 381 382

multiplication is basically the same with the approach described in Algorithm 3 for calculating matrix A. Thus, the calculation of matrix multiplication is no longer a huge obstacle. There have been many tools to carry out the SVD on the GPU. Jacket6, the premier GPU software plugin for MATLAB, provides the SVD function. However, it is not for free use. So we prefer CULA [15] that provides the SVD function on the GPU in its free academic edition. Moreover, it is easy to launch MATLAB with CULA acceleration enabled.7 Note that all the above tools do not consider the data partitioning scheme that manipulates out-of-core matrices on the GPU. However, in our case, we only need to calculate the SVD of the matrix V 2 RMD . Since D is far smaller than N, the existing tools can meet our needs. We finally utilize CULA for performing the SVD on the GPU.

6 7

http://www.accelereyes.com/. http://www.culatools.com/blog/2011/05/31/accelerate-matlab-with-the-cula-link-interface/.

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1

9

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx Table 1 The statistics of the tag recommendation data sets. MIRFlickr

Training set Testing sets

NUS-WIDE

#Images

#Tags

Density

#Images

#Tags

Density

3951 1982

1386 1386

0.66% —

22,933 11,465

5018 5018

0.45% —

383

6. Experimental evaluations

384

388

We evaluated the performance of the proposed GPU-accelerated NN-Sparse LSA model, both in terms of time efficiency and accuracy, considering the social-tagging applications described in Section 3. We utilized a workstation equipped with two Intel Xeon E5620 (2.40 GHz) CPUs and 16 GB RAM. In order to accelerate our fast optimization algorithm on GPU, we utilized the NVIDIA Tesla C20758 GPU card which contains 448 NVIDIA CUDA cores and 6 GB memory. We implemented the CUDA codes using C and Matlab.9

389

6.1. Tag recommendation

390

In this section, we introduce our experiments on tag recommendation using GPU-accelerated NN-Sparse LSA model. Our model aims at recommending tags to new resources.

385 386 387

391

392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407

408 409 410 411 412

6.1.1. Data sets We conducted tag recommendation experiments on the MIRFlickr-25 K [14] and NUS-WIDE-270 K [8] datasets. MIRFlickr-25 K and NUS-WIDE-270 K datasets were both collected from Flickr website. The MIRFlickr-25 K dataset contains 25,000 images with 1386 unique tags. The NUS-WIDE-270 K which is the largest publicly available human-annotated dataset, includes a total of 269,648 images with 5018 unique tags. There are 38 and 81 categories in the MIRFlickr and NUS-WIDE image datasets, respectively. The associations between images and tags are very sparse in both datasets. So we first chose images associated with at least 10 and 15 unique tags from the MIRFlickr and NUS-WIDE datasets respectively, and then split those images into a training set and a number of testing sets with the ratio of 2:1. The testing sets differ in the number of tags associated with each image, i.e. 1–5 tags, to simulate new images in the cold start situation. We further left out one existing tag for each image in the testing set, to act as a validation set for performance evaluation. We finally obtained 1 training set, 5 testing sets and 5 corresponding validation sets for each social-tagging dataset. Table 1 shows the statistics of our final datasets. Table 2 reports the densities of the 5 used testing sets. Note that these densities are very low. The training set is used as input to learn a model, then a Top-N recommendation list for each image in the testing set is generated by the model. The evaluation is conducted by comparing the Top-N recommendation list and the corresponding tag for each image in the validation set. 6.1.2. Evaluation metrics We utilized Hit Rate (HR) and the Average Reciprocal Hit-Rank (ARHR) to evaluate the recommendation quality. It has been generally accepted that HR and ARHR are two direct and meaningful measures for the evaluation of the Top-N recommendation. HR is defined as:

413

HR ¼

415 416 417 418 419

#hits ; #images

where #images is the total number of images in the testing set, and #hits is the number of images whose tag in the validation set also exists in the Top-N recommendation list. HR measures the ability of the method to generate a relevant Top-N list for each image. Another metric is ARHR, which is defined as:

420

X1 1 ; #images i¼1 pi #hits

ARHR ¼

422 423 424

where p is the position of the hitting tag in the Top-N recommendation list. ARHR is a weighted version of HR and it measures the ability of the method to generate an appealing Top-N list for each image. 8 9

http://www.nvidia.com/object/personal-supercomputing.html. http://www.mathworks.cn/cn/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html.

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

10

Table 2 Densities of testing sets with 1–5 tags associated with each image.

425 426 427 428 429 430 431 432 433 434 435 436

437 438 439 440 441 442 443 444 445

Density (%)

#1

#2

#3

#4

#5

MIRFlickr testing set NUS-WIDE testing set

0.072 0.019

0.144 0.039

0.216 0.059

0.289 0.079

0.361 0.099

6.1.3. Comparison algorithms and details We compare the performance of our GPU-accelerated NN-Sparse LSA model with 3 other popular Top-N recommendation algorithms. The first one is the item-based K-Nearest-Neighbor (itemKNN) method [11], where the similarity between pairs of items is calculated by the cosine similarity measure. The second one is the PureSVD method [9] which is proven that outperforms other matrix factorization methods in Top-N recommendation. PureSVD also has the convenience of representing the users as a combination of item features and offering designers a flexibility in handling new users, and new ratings by existing users. The last one is an approach based on LDA for recommending tags of resources [17]. As described in Section 3.2, LDA is used to elicit latent topics from resources equipped with a fairly stable and complete tag set. Then, new resources with only a few tags are mapped to these topics. Based on this, other tags belonging to a topic can be recommended for the new resource. Regarding implementation details, we implemented the itemKNN and PureSVD methods in C, and utilized the MATLAB code accompanying the paper of Shan and Banerjee [28] as the implementation of LDA. 6.1.4. Experimental results In this section, we present the experimental results of the tested methods in the tag recommendation task. In our experiments, we think Top-10 recommendations is a reasonable setting. Actually users cannot bear a long list of recommended tags. We tuned the parameter settings of each method to achieve the best performance. The optimal parameter settings of each method are as follows:    

NN-Sparse LSA: the dimensionality of the latent space D ¼ 100, and the regularization parameter k ¼ 0:03. PureSVD: the number of singular values is 80. LDA: the dimensionality of the latent space is 50, and the Laplacian smoothing parameter is 105 . itemKNN: the number of neighbors selected for each item is 5.

446

462

We report the Top-10 recommendation performance (in terms of HR and ARHR) of different tested methods using different testing sets (#1, #2, . . ., #5) in Fig. 2. These testing sets have 1–5 known tags for each image, respectively. The results in Fig. 2 show that our NN-Sparse LSA method can achieve better performance than the other methods on all testing sets. Due to the non-negative sparse projection matrix, our method can obtain more interpretable image–topic–tag relationship, thus performing better on all the defined testing sets. The LDA-based method also performs well on the testing sets #1 and #2. This indicates that topic modeling can better deal with the cold start problem. Even though known tags of new resources are few in number, topic modeling methods have the ability to recommend appropriate tags belonging to the most relevant latent topics. PureSVD and itemKNN methods do not address effectively the cold start problem. Specifically PureSVD method cannot properly represent each image with only 1 or 2 known tags as a combination of item features, while the itemKNN method can only recommend tags which are similar to known tags. When the known tags are few in number, the itemKNN method will ignore many proper candidate tags. We further present the Top-10 recommendation performance (in terms of HR and ARHR) of the examined methods on the testing set #5 in Table 3. The results indicate that NN-Sparse LSA shows the best performance on both datasets. For MIRFlickr dataset, the HR result of NN-Sparse LSA is 5.98%, 51.7%, 73.9% relatively better than that of PureSVD, LDA, itemKNN respectively. The ARHR result of NN-Sparse LSA is 80.6%, 84.95%, 209.7% relatively better than that of the other methods, respectively.

463

6.2. Image classification

464

Since the NN-Sparse LSA model is a dimension reduction method, we tested its performance for image classification after projecting the image representation from the tag space to the low-dimensional space.

447 448 449 450 451 452 453 454 455 456 457 458 459 460 461

465

466 467 468 469 470 471 472

6.2.1. Data sets We also carried out our experiments on the MIRFlickr-25 K [14] and NUS-WIDE-270 K [8] datasets. Note that we used all data in these two datasets to test the performance and the efficiency of all the considered methods in the image classification task. Table 4 shows the statistics of our experimental datasets. After transforming the original data into an image–tag relationship matrix, we calculate the TF/IDF-like weights of tags associated with each image to indicate the most meaningful tags to it. For each category of images, we randomly split images into a training and a testing set with the ratio 1:1. In the image classification scenario, we consider the images that belongs to Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1

11

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

Fig. 2. Top-10 Recommendation results on different testing sets.

Q4

Table 3 Top-10 recommendation results on testing set #5. MIRFlickr

NN-Sparse LSA PureSVD LDA ItemKNN

NUS-WIDE

HR

ARHR

HR

ARHR

0.2518 0.2376 0.1660 0.1448

0.1217 0.0674 0.0658 0.0393

0.1537 0.1197 0.1185 0.0681

0.0572 0.0255 0.0434 0.0240

Table 4 The statistics of the image classification datasets.

473 474 475 476 477 478

479 481

482 483 484

Dataset

#Images

#Tags

#Categories

MIRFlickr-25 K NUS-WIDE-270 K

25,000 269,648

1386 5018

38 81

their own category as positive samples, and then we randomly choose the same number of images from other categories as the negative samples in the training set. 6.2.2. Evaluation metrics We use Accuracy (ACC) to measure the classification performance. For each category, let M denote the total number of images in the testing set, and M r is the number of images that are assigned to the right categories according to the ground truth, then the classification accuracy is defined as:

ACC ¼

Mr : M

6.2.3. Comparison algorithms and details We compare our GPU-accelerated NN-Sparse LSA with the other four popular dimension reduction methods: (1) Original Non-negative Sparse LSA (NN-Sparse LSA) [7]. (2) Sparse LSA [7]. (3) Traditional LSA. (4) Sparse Coding [18]. For the original Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1

486 487 488 489 490 491 492

493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

NN-Sparse LSA, we apply the original coordinate descent method to calculate matrix A. We follow the optimization algorithm proposed in [7] to solve the Sparse LSA model. For the Sparse Coding model, we directly utilize the code from [18]. The application of our GPU-accelerated NN-Sparse LSA model for the image classification task has already been described in Section 3.3. We followed the same procedure for performing image classification using the other four comparison methods. All these methods are used to project the image–tagging data into a low-dimensional space. Then, image classification can be performed in the low-dimensional space with a linear SVM classifier from LIBSVM [5], whose regularization parameter c is selected by cross-validation. In order to evaluate the experimental results, we also report results calculated by the linear SVM on the original data as the baseline.

6.2.4. Results on MIRFlickr-25 K In this section, we report the experimental results on the MIRFlickr-25 K data set. For the Sparse LSA and NN-Sparse LSA model, we set the regularization parameter k ¼ 0:05, while for the Sparse Coding model, we set the regularization parameter b ¼ 0:1. Fig. 3 shows the average classification accuracy results of the examined methods for varying dimensionalities. Moreover, the classification accuracy for each category and for the dimensionality equal to 500, is shown in Table 5. Note that the performance of our GPU-accelerated NN-Sparse LSA model is almost the same as that of the original NN-Sparse LSA model, for all dimensionality settings, which indicates that our fast optimization algorithm is correct, since it obtains the same results as the original one. When the dimensionality is small, our fast approach is slightly worse than the traditional LSA and Sparse Coding model. This is possibly because the number of elements in the sparse projection matrix A is small, incurring losses of some important latent relationships. However, for a larger dimensionality, our fast approach shows its advantage, achieving a similar classification accuracy using a much sparser projection matrix (See Table 6). A sparse projection matrix A will save a lot of computational and storage cost. Note that the Sparse LSA performs worse than the linear SVM for large dimensionality, the reason for this is that the negative entries in the projection matrix A are meaningless. As the dimensionality grows up, the proportion of meaningful entries in the sparse projection matrix decreases significantly, which incurs the ineffective lowdimensional representation of an image. Table 7 shows the average projection time of four different dimension reduction methods. The projection time is defined as the CPU time of projecting an image from the original tag space to a low-dimensional space. The average projection time of our approach is many times smaller than that of the LSA and Sparse Coding model, while the classification accuracy of our GPU-accelerated approach is comparable (slightly worse) than that of the LSA and Sparse Coding model. The Sparse Coding model takes very long processing time (compared to other examined methods) to calculate M L1 -norm regularized problems in total, within the projection operation of the Sparse Coding [7]. Hence, our approach is much more scalable than other examined methods when dealing with large-scale social tagging data. Moreover, we compared the efficiency of our GPU-accelerated optimization algorithm against CPU-based implementations. The GPU-based implementation was tested on NVIDIA Tesla C2075 GPU card. The CPU-based implementations were tested on the Intel Xeon E5620 (2.40 GHz) CPU and 16 GB RAM. Only one core of the CPU was used. We rewrote the original optimization algorithm for NN-Sparse LSA model in the C/C++ language, which runs much faster than the original code in MATLAB. We also implemented our fast optimization method in C/C++ in order to evaluate the efficiency of our approach on CPU. Table 8 reports the average execution time per iteration for different implementations of NN-Sparse LSA model. Our GPUaccelerated approach shows a 28 speedup on average compared to the CPU-based implementation of the original optimization algorithm. Moreover, the CPU-based implementation of our fast optimization algorithm also accelerates 4.5 times the original optimization algorithm, which indicates that our fast optimization algorithm can save significant computational

Classification Accuracy (%)

485

12

Dimensionality of the low-dimensional space Fig. 3. Average classification accuracy on MIRFlickr-25 K.

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1

13

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx Table 5 The classification accuracy (%) for each category on MIRFlickr-25 K. Method

Animals

Baby

Baby*

Bird

Bird*

Car

Car*

NN-Sparse LSA (GPU)+SVM Original NN-Sparse LSA + SVM Sparse LSA + SVM Traditional LSA + SVM Sparse Coding + SVM Linear SVM

50.43 50.18 42.78 54.04 51.80 52.92

86.04 85.27 36.43 86.04 87.59 37.20

48.27 46.55 46.55 46.55 48.27 43.10

56.60 57.95 49.59 57.68 52.29 48.78

62.80 62.80 59.91 66.11 61.57 62.80

50.34 50.17 39.28 52.50 53.91 47.27

59.47 60.00 47.36 60.52 54.73 55.26

Clouds

Clouds*

Dog

Dog*

Female

Female*

Flower

56.21 56.43 46.43 60.91 55.51 57.62

66.07 65.92 52.29 68.00 63.70 59.11

65.49 64.91 60.23 65.78 62.57 61.98

70.84 71.18 66.77 70.84 68.47 70.16

82.92 83.40 82.98 81.04 82.21 78.33

82.87 82.82 42.49 81.06 82.82 48.56

57.29 57.40 50.16 58.17 56.53 53.67

Flower*

Food

Indoor

Lake

Male

Male*

Night

69.70 69.70 63.94 70.81 70.63 65.98

48.88 48.68 41.21 53.33 55.75 46.86

84.64 84.57 86.35 82.12 83.30 80.94

55.18 54.93 47.59 57.97 54.93 53.67

82.43 82.33 82.99 79.80 80.95 77.73

82.33 82.11 80.25 79.53 81.84 51.01

44.28 44.50 33.50 48.85 48.19 43.09

Night*

People

People*

Plant life

Portrait

Portrait*

River

61.37 61.67 48.50 65.56 60.47 57.18

80.15 80.35 80.75 77.99 82.89 77.84

83.20 83.58 84.32 81.75 83.86 81.03

43.98 44.35 38.73 48.66 48.07 50.55

84.93 85.03 82.84 83.30 84.47 79.84

85.21 84.79 40.80 83.49 84.79 48.06

59.50 59.50 52.12 63.08 57.49 59.06

River

Sea

Sea*

Sky

Structures

Sunset

Transport

58.10 58.10 59.45 59.45 59.45 59.45

60.21 60.06 51.13 62.02 60.06 58.69

92.52 93.45 58.87 75.70 69.15 61.68

53.46 53.41 46.33 59.42 55.78 58.54

51.66 51.88 51.02 56.66 53.46 61.60

58.38 58.20 48.73 61.76 58.66 56.79

47.13 46.78 41.39 52.24 46.37 53.14

Tree

Tree*

Water

Mean

45.02 45.45 37.03 50.79 47.50 48.09

55.98 55.98 44.31 58.08 52.39 49.70

55.37 55.43 46.66 60.12 55.85 55.79

64.19 64.21 54.53 65.31 63.64 58.24

NN-Sparse LSA (GPU)+SVM Original NN-Sparse LSA + SVM Sparse LSA + SVM Traditional LSA + SVM Sparse Coding + SVM Linear SVM

NN-Sparse LSA (GPU)+SVM Original NN-Sparse LSA + SVM Sparse LSA + SVM Traditional LSA + SVM Sparse Coding + SVM Linear SVM

NN-Sparse LSA (GPU)+SVM Original NN-Sparse LSA + SVM Sparse LSA + SVM Traditional LSA + SVM Sparse Coding + SVM Linear SVM

NN-Sparse LSA (GPU)+SVM Original NN-Sparse LSA + SVM Sparse LSA + SVM Traditional LSA + SVM Sparse Coding + SVM Linear SVM

NN-Sparse LSA (GPU)+SVM Original NN-Sparse LSA + SVM Sparse LSA + SVM Traditional LSA + SVM Sparse Coding + SVM Linear SVM

Table 6 The density of the projection matrix (%) for each method on MIRFlickr. Method

D = 100

D = 300

D = 500

D = 700

GPU-accelerated NN-Sparse LSA Original NN-Sparse LSA Sparse LSA Traditional LSA Sparse Coding

0.33 0.33 1.42 100 93

0.18 0.18 0.42 100 95.82

0.14 0.14 0.24 100 96.68

0.12 0.12 0.16 100 95.53

Table 7 The average projection time for each method on MIRFlickr. Method

Projection time (s)

GPU-accelerated NN-Sparse LSA Sparse LSA Traditional LSA Sparse Coding

0.1918 0.1595 1.1978 350.95

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1

14

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

Table 8 The average time (second) per iteration for each implementation on MIRFlickr.

526 527 528 529 530 531 532

533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548

Method

D = 100

D = 300

D = 500

D = 700

NN-Sparse LSA (GPU, optimized) NN-Sparse LSA (CPU, optimized) NN-Sparse LSA (CPU, original)

1.57 6.03 20.71

2.93 17.64 61.97

3.87 29.55 120.63

7.49 57.89 363.95

cost. We also note that the speedup increases as the dimensionality increases. When the dimensionality is 700, our GPUaccelerated approach can achieve a 48x speedup against the original optimization algorithm on CPU. Table 9 also shows the total processing time for the different tested dimensions. When the dimensionality D is small, traditional LSA has the shortest execution time. This is to be expected since the algorithm to solve LSA is not an iterative method. Nevertheless, as the dimensionality increases, our GPU-accelerated NN-Sparse LSA model shows a comparable and even shorter execution time than the LSA model. In practice, it is very important to obtain a much sparser projection matrix within the acceptable time. 6.2.5. Results on NUS-WIDE-270 K In this section, we introduce the experimental results on the NUS-WIDE-270 K dataset. Note that NUS-WIDE-270 K has a much larger number of images than MIRFlickr-25 K. Hence, some dimension reduction methods (i.e., Sparse LSA, Sparse Coding) cannot handle such a large scale dataset within an acceptable time frame. For this reason, we compare our GPU-accelerated NN-Sparse LSA method only with the traditional LSA method. Fig. 4 illustrates the average classification accuracy for varying dimensionality of the latent space. Both our approach and LSA model can perform better than the linear SVM model. Our approach and traditional LSA are comparable in terms of performance. When the dimensionality is small, our GPU-accelerated NN-Sparse LSA model is more accurate than the LSA model. In the contrary, for higher dimensionalities, LSA model exhibits slightly better performance than our approach. These results indicate that our GPU-accelerated NN-Sparse LSA algorithm performs well with a much sparser projection matrix, while saving a lot of computational and storage cost when dealing with large-scale datasets. Table 10 reports the average time per iteration for different implementations of NN-Sparse LSA model. Obviously, our GPU-accelerated NN-Sparse LSA model offers a significant speedup compared to CPU-based implementations of NN-Sparse LSA model. Our parallel algorithm on GPU is 63 times faster, on average, than the CPU-based implementation of the original optimization algorithm. Moreover, our fast optimization algorithm on CPU shows a 7.6 speedup on average against the original optimization algorithm on CPU. When the dimensionality is 1000, our GPU-accelerated approach can perform about

Table 9 The total execution time (second) for different methods on MIRFlickr. Method

D = 100

D = 300

D = 500

D = 700

GPU-accelerated NN-Sparse LSA Sparse LSA Traditional LSA Sparse Coding

42.39 753.84 8.79 798.2

115.29 2857.65 72.74 1549.4

201.24 6446.88 255.86 7089.6

292.15 10226.5 386.60 9481.6

Fig. 4. Average classification accuracy on NUS-WIDE-270 K.

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1

15

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx Table 10 Average time (second) per iteration for each implementation on NUS-WIDE. Method

D = 300

D = 500

D = 700

D = 1000

NN-Sparse LSA (GPU, optimized) NN-Sparse LSA (CPU, optimized) NN-Sparse LSA (CPU, original)

114.64 654.77 2231.05

144.88 1053.63 3935.11

182.39 1500.06 16763.01

243.52 2240.84 27188.68

550

112 times faster than the original optimization algorithm on CPU. Hence, our GPU-accelerated algorithm is very efficient and suitable for handling the large-scale social tagging data.

551

7. Conclusions and future work

552

560

In order to utilize social tagging data in tag recommendation and image classification tasks, we propose a GPU-accelerated fast optimization algorithm for Non-negative Sparse LSA that has a much lower computational complexity for calculating the projection matrix and is more suitable for parallelization. Moreover, we parallelize our approach using NVIDIA CUDA framework and GPU card. The experimental results on MIRFLICKR and NUS-WIDE datasets show that our parallel approach outperforms the other tested approaches in terms of time performance and recommendation or classification accuracy. In the future, we will investigate how to incorporate the visual information of images into the social tag recommendation and image classification tasks. We intend to employ Sparse Canonical Correlation Analysis to address the combination of the visual information and tags as a multi-view learning problem and develop a scalable optimization algorithm with a fast rate of convergence and parallelize it on GPU cards.

561

Acknowledgments

562

565

We would like to thank the anonymous reviewers for their insightful comments on this paper. This work was supported in part by the 973 Program (No. 2012CB316400), the Special Funds for Key Program of National Science and Technology (No. 2010ZX01042-002-003), the Chinese Knowledge Center for Engineering Sciences and Technology Project, the Program for Key Cultural Innovative Research Team of Zhejiang Province, the Fundamental Research Funds for the Central Universities.

566

References

549

553 554 555 556 557 558 559

563 564

567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]

M. Andrecut, Parallel GPU Implementation of Iterative PCA Algorithms, CoRR abs/0811.1081, 2008. D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022. G. Casalino, N.D. Buono, C. Mencar, Subtractive clustering for seeding non-negative matrix factorizations, Inf. Sci. 257 (2014) 369–387. J.M. Cavanagh, T.E. Potok, X. Cui, Parallel latent semantic analysis using a graphics processing unit, in: Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference, GECCO’09, pp. 2505–2510. C.C. Chang, C.J. Lin, Libsvm: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (2011) (Article No. 27). L. Chen, D. Xu, I.W. Tsang, J. Luo, Tag-based image retrieval improved by augmented features and group-based refinement, IEEE Trans. Multimedia 14 (2012) 1057–1067. X. Chen, Y. Qi, B. Bai, Q. Lin, J.G. Carbonell, Sparse latent semantic analysis, in: Proceedings of the 11th SIAM International Conference on Data Mining, SDM’11, pp. 474–485. T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y.T. Zheng, Nus-wide: a real-world web image database from national university of singapore, in: Proceedings of the 8th ACM International Conference on Image and Video Retrieval, CIVR’09 (Article No. 48). P. Cremonesi, Y. Koren, R. Turrin, Performance of recommender algorithms on top-N recommendation tasks, in: Proceedings of the 4th ACM Conference on Recommender Systems, RecSys’10, pp. 39–46. S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990) 391–407. M. Deshpande, G. Karypis, Item-based top-N recommendation algorithms, ACM Trans. Inf. Syst. 22 (2004) 143–177. S.A. Golder, B.A. Huberman, Usage patterns of collaborative tagging systems, J. Inf. Sci. 32 (2006) 198–208. H. Guan, J. Zhou, B. Xiao, M. Guo, T. Yang, Fast dimension reduction for document classification based on imprecise spectrum analysis, Inf. Sci. 222 (2013) 147–162. M.J. Huiskes, M.S. Lew, The mir flickr retrieval evaluation, in: Proceedings of the 1st ACM SIGMM International Conference on Multimedia Information Retrieval, MIR’08, pp. 39–43. J. Humphrey, D. Price, K. Spagnoli, A. Paolini, E. Kelmelis, Cula: hybrid GPU accelerated linear algebra routines, in: Proceedings of SPIE7705, Modeling and Simulation for Defense Systems and Applications V, (Article No. 770502). W. Kahan, Pracniques: further remarks on reducing truncation errors, Commun. ACM 8 (1965) 40. R. Krestel, P. Fankhauser, W. Nejdl, Latent Dirichlet allocation for tag recommendation, in: Proceedings of the 3rd ACM Conference on Recommender systems, RecSys’09, pp. 61–68. H. Lee, A. Battle, R. Raina, A.Y. Ng, Efficient sparse coding algorithms, in: Proceedings of the 20th Annual Conference on Neural Information Processing Systems, NIPS’06, pp. 801–808. G. Li, M. Wang, Z. Lu, R. Hong, T.S. Chua, In video product annotation with web information mining, ACM Trans. Multimedia Comput. Commun. Appl. 8 (2012) 55. G. Li, M. Wang, Y.T. Zheng, H. Li, Z.J. Zha, T.S. Chua, Shottagger: tag location for internet videos, in: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR’11 (Article No. 37). J. Lin, L.Y. Duan, J. Yuan, Q. Li, S. Luo, Learning sparse tag patterns for social image classification, in: Proceedings of the 19th IEEE International Conference on Image Processing, ICIP’12, pp. 2881–2884. Z. Liu, Y. Zhang, E.Y. Chang, M. Sun, Plda+: Parallel latent Dirichlet allocation with data placement and pipeline processing, ACM Trans. Intell. Syst. Technol. 2 (2011) (Article No. 26).

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047

INS 10858

No. of Pages 16, Model 3G

12 May 2014 Q1 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632

16

Y. Zhang et al. / Information Sciences xxx (2014) xxx–xxx

[23] N. Lopes, B. Ribeiro, Non-negative matrix factorization implementation using graphic processing units, in: Proceedings of 11th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL’10, pp. 275–283. [24] L.B. Marinho, A. Hotho, R. Jäschke, A. Nanopoulos, S. Rendle, L. Schmidt-Thieme, G. Stumme, P. Symeonidis, Recommender Systems for Social Tagging Systems, Springer Briefs in Electrical and Computer Engineering, Springer, 2012. [25] L. Nie, M. Wang, Z.J. Zha, T.S. Chua, Oracle in image search: a content-based approach to performance prediction, ACM Trans. Inf. Syst. 30 (2012) 13. [26] N. Nori, D. Bollegala, H. Kashima, Multinomial relation prediction in social data: a dimension reduction approach, in: Proceedings of the 26th AAAI Conference on Artificial Intelligence, AAAI’12, pp. 115–121. [27] J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. Lefohn, T. Purcell, A survey of general-purpose computation on graphics hardware, Comput. Graph. Forum 26 (2007) 90–113. [28] H. Shan, A. Banerjee, Mixed-membership naive bayes models, Data Min. Knowl. Discov. 23 (2011) 1–62. [29] B. Sigurbjörnsson, R. van Zwol, Flickr tag recommendation based on collective knowledge, in: Proceedings of the 17th International Conference on World Wide Web, WWW’08, pp. 327–336. [30] A.J. Smola, S. Narayanamurthy, An architecture for parallel topic models, PVLDB 3 (2010) 703–710. [31] P. Symeonidis, A. Nanopoulos, Y. Manolopoulos, Tag recommendations based on tensor dimensionality reduction, in: Proceedings of the 2nd ACM Conference on Recommender Systems, RecSys’08, pp. 43–50. [32] P. Symeonidis, A. Nanopoulos, Y. Manolopoulos, A unified framework for providing recommendations in social tagging systems based on ternary semantic analysis, IEEE Trans. Knowl. Data Eng. 22 (2010) 179–192. [33] M. Wang, H. Li, D. Tao, K. Lu, X. Wu, Multimodal graph-based reranking for web image search, IEEE Trans. Image Process. 21 (2012) 4649–4661. [34] M. Wang, B. Ni, X.S. Hua, T.S. Chua, Assistive tagging: a survey of multimedia tagging with human–computer joint exploration, ACM Comput. Surv. 44 (2012) (Article No. 25). [35] M. Wang, K. Yang, X.S. Hua, H.J. Zhang, Towards a relevant and diverse search of social images, IEEE Trans. Multimedia 12 (2010) 829–842. [36] Q. Wang, L. Ruan, Z. Zhang, L. Si, Learning compact hashing codes for efficient tag completion and prediction, in: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, pp. 1789–1794. [37] Q. Wang, J. Xu, H. Li, N. Craswell, Regularized latent semantic indexing, in: Proceedings of the 34th Annual International ACM SIGIR Conference, SIGIR’11, pp. 685–694. [38] L. Wu, R. Jin, A.K. Jain, Tag completion for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2013) 716–727. [39] F. Yan, N. Xu, Y. Qi, Parallel inference for latent Dirichlet allocation on graphics processing units, in: Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, NIPS’09, pp. 2134–2142. [40] G. Zhu, S. Yan, Y. Ma, Image tag refinement towards low-rank, content-tag prior and error sparsity, in: Proceedings of the 18th International Conference on Multimedia, MM’10, pp. 461–470.

633

Q1 Please cite this article in press as: Y. Zhang et al., A GPU-accelerated non-negative sparse latent semantic analysis algorithm for social tagging data, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.04.047