Projective nonnegative matrix factorization for social image retrieval

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Projectiv...

Download PDF

985KB Sizes 1 Downloads 116 Views

Report

PDF Reader
Full Text

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Projective nonnegative matrix factorization for social image retrieval Qiuli Liu a, Zechao Li b,n a b

Capital University of Economics and Business, No. 121 Zhangjialukou Road, Fengtai District, Beijing 100070, PR China School of Computer Science, Nanjing University of Science and Technology, No. 200, Xiaolinwei Road, Nanjing 210094, PR China

art ic l e i nf o

a b s t r a c t

Article history: Received 30 October 2013 Received in revised form 18 September 2014 Accepted 23 September 2014

Increasingly many social images with tags are available on photo sharing websites. Due to the subjectivity and diversity of social tagging behaviors, noisy and missing tags for images are inevitable. To tackle this problem, this paper proposes a novel factor analysis model, named ProjecTive Nonnegative Matrix Factorization (PTNMF) with ℓ2;1 -norm regularization, which introduces linear transformation and ℓ2;1 -norm minimization into a joint framework of NMF. For tagging data, a new interpretation is adopted to distinguish the relevant tags and irrelevant tags instead of the typically used binary scheme. In our model, the image latent representation is assumed to be projected from its original feature representation with an orthogonal transformation matrix. The projection makes convenient to embed any images including out-of-samples into the latent space. That is, the proposed method enables to handle the out-of-sample problem. The ℓ2;1 -norm regularization makes the transformation matrix suitable for selecting the effective features. Local geometry preservations of image space (tag space) are explored as constraints in order to make image similarity (tag correlation) consistent in the original space and the corresponding latent space. We investigate the performance of the proposed method on image retrieval and compare it to existing work on the challenging NUS-WIDE dataset. Extensive experiments indicate the effectiveness and potentials of the proposed method in real-world applications. & 2015 Elsevier B.V. All rights reserved.

Keywords: Nonnegative matrix factorization Latent subspace Image retrieval

1. Introduction With the permeation of Web 2.0 technologies and digital cameras, there is an explosion of social media sharing system available online, such as Flickr, Facebook and Zooomr. Users can share their photos, tag and comment others’ ones. Tagging, in general, allows users to describe an image with a list of tags, which can be utilized to search, browse and organize images. However, due to the subjectivity and diversity of amateur tagging, tags are known to be ambiguous, limited in terms of completeness, and overly personalized [1,2]. As a consequence, user-provided tags are often imprecise, biased and incomplete for describing the content of the images. Thus, during recent years, it has attracted much research attention to reliably learn the relevance of a tag with respect to the visual content it is describing [3,4], which is an essential issue for image retrieval. Many efforts have been made to learn the relevance of tags to images. One category of tag relevance learning methods is to predict relevant tags for images with no tag [3–11]. One of the related works is Multi-correlation Probabilistic Matrix Factorization (MPMF) model

n

Corresponding author. E-mail address: [email protected] (Z. Li).

[9], which is to combine the inter- and intra-correlation matrices by the shared latent matrices based on Probabilistic Matrix Factorization (PMF) [12]. They are mostly train on manually labeled data and tested on small data sets [13], which make them unsuitable for social image tagging. In the second scenario, given an image labeled with some tags, tag relevance learning can be used to remove noisy tags, recommend new relevant tags or reduce tag ambiguity [2,14– 19]. Tag ranking [15] exploits pairwise similarity between tags by random walk to reﬁne the ranking score. In [16,17], both the image similarity and tag correlation are exploited simultaneously to discover the tag relevance. In [16], the image tags are reﬁned by adding consistent constraints on Robust Principal Component Analysis (RPCA) [20]. However, they cannot handle the out-of-sample problem and do not utilize the visual feature directly. That is, massive new images can not be tagged. To this end, in this paper, we tackle the social image tag relevance task by the Nonnegative Matrix Factorization (NMF) model. Different from the above MF-based methods, we propose a novel NMF algorithm, namely ProjecTive Nonnegative Matrix Factorization (PTNMF) with ℓ2;1 -norm regularization, to collaboratively predict the tag relevance. To handle missing tags and noisy tags, a new interpretation scheme is proposed motivated by [21]. To handle the out-of-sample problem, we assume that the latent image representation is an explicit projection from original image representation via an orthogonal

http://dx.doi.org/10.1016/j.neucom.2014.09.094 0925-2312/& 2015 Elsevier B.V. All rights reserved.

Please cite this article as: Q. Liu, Z. Li, Projective nonnegative matrix factorization for social image retrieval, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2014.09.094i

Q. Liu, Z. Li / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

transformation matrix. The ℓ2;1 -norm regularization is introduced to learn a reliable transformation matrix [22]. To preserve the local geometrical properties in both visual space and semantic space, the visual similarity and tag relevance are explored jointly and simultaneously. We also propose an efﬁcient iterative algorithm to optimize the problem. Experiments on different social image datasets demonstrate that our algorithm outperforms the state-of-the-art algorithms. The main contributions of this paper are summarized as follows: 1. We propose a novel ProjecTive Nonnegative Matrix Factorization (PTNMF) algorithm, which can handle the out-of-sample problem with the assumption that the image latent representation is projected from its original feature representation through an orthogonal transformation matrix. 2. To handle the irrelevant visual features, an ℓ2;1 -norm regularization is introduced to learn a reliable transformation matrix which is suitable for selection of the effective features. 3. To keep the local geometrical properties in both visual and semantic spaces, the visual similarity and tag relevance are explored jointly and simultaneously. The reminder of this paper is organized as follows. We review related work in Section 2. Section 3 elaborates the proposed nonlinear matrix factorization with uniﬁed embedding algorithm. In Section 4, extensive experiments are conducted to evaluate the performance of the proposed method and compare it to other related methods. The conclusion of this paper with future work discussion is presented in Section 5.

to address the tag ranking problem in [17]. A shared subspace learning framework based on NMF is proposed to leverage a secondary source to improve retrieval performance from a primary dataset in [40]. In [39], a uniﬁed subspace is learned for images and tags, which can reﬁne tags of images by nearest tags search in the underlying subspace. However, the above methods focus on reﬁning images tags. They cannot assign tags to the new images. That is, they cannot address the out-of-sample problem. Different from the previous work, this paper presents a novel projective nonnegative matrix factorization approach to estimate the relevance of tags to social images. The local visual geometry in image space and local textual geometry in tag space are exploited simultaneously. This method can be employed to tag and reﬁne images.

3. The proposed PTNMF algorithm 3.1. Nonnegative matrix factorization Before getting started, we ﬁrst summarize some notations. Throughout this paper, we use bold uppercase characters to denote matrices, bold lowercase characters to denote vectors. For an arbitrary matrix M, mi means the ith column vector of M, M ij denotes the ði; jÞth entry of M and Tr½M is the trace of M if M is square. MT is the matrix transposition operation. For any M A Rrt , its ℓ2;1 -norm is deﬁned as vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ r u t X uX t M 2ij ¼ Tr½MT DM: ð1Þ ‖M‖2;1 ¼ i¼1

2. Related work It is an essential issue to estimate the relevance of tags with respect to images in text-based image retrieval. The related techniques are categorized into two main scenarios, namely tag annotation for untagged images and tag reﬁnement for tagged images. Methods in the ﬁrst category predict relevant tags for images with no tag. A variety of methods have been proposed to annotate images automatically [5,7,9,23–30], which can be categorized into two main types, generative models and classiﬁcation models. The generative models try to estimate the probabilistic relationship between tags and images. By assigning relevant scores of tags to images, the annotated results can be utilized to help the task of image retrieval. In the second scenario, given an image labeled with some tags, tag relevance learning can be used to remove noisy tags, recommend new relevant tags or reduce tag ambiguity. Many approaches have been proposed to tackle the tag relevance learning problem [2,14–17,31– 39]. The Random Walk with Restarts (RWR) algorithm [31] is proposed to leverage the co-occurrence-based tag similarity and the information of the original annotated order of tags. The tag reﬁnement problem is formulated as a Markov process and the candidate tags are treated as the states in [14]. In [2], a neighbor voting algorithm is proposed to estimate a tag's relevance by exploiting tagging redundancies among multiple users. The tag relevance is determined based on the number of such votes from the nearest neighbors. Tag ranking [15] further exploits pairwise similarity between tags by random walk to reﬁne the ranking score. In [9,16,17], both the image similarity and tag correlation are exploited simultaneously to discover the tag relevance. A Multi-correlation Probabilistic Matrix Factorization (MPMF) model [9] is proposed to combine the inter- and intra-correlation matrices by the shared latent matrices. In [16], the image labels are reﬁned by decomposing the observed label matrix into a low-rank reﬁned matrix and a sparse error matrix. A two-view learning approach is proposed

j¼1

Here D is a diagonal matrix with Dii ¼ ‖m1i ‖ . 2 Assume that we have n tagged images and m tags. Let X ¼ ½x1 ; x2 ; …; xn denote the data sample set, in which xi A Rd denotes the feature descriptor of the ith sample. Y A Rmn denotes the tag-image associated matrix, such that Y ij ¼ 1 if xi is tagged by the jth tag, and 0 otherwise. The task of NMF is to derive two nonnegative factor matrices U ¼ ½u1 ; u2 ; …; um A Rpm and þ V ¼ ½v1 ; v2 ; …; vn A Rpn þ , where p o minðm; nÞ. That is, given Y A Rmn , a low-rank NMF approach seeks to approximate it by a multiplication of two p-rank nonnegative factors min‖Y UT V‖2F U;V

s:t:

U; V Z 0;

ð2Þ

where ‖ ‖F denotes the Frobenius norm. To optimize the objective, an iterative multiplicative updating algorithm was proposed in [41] as follows: V mj ’V mj

ðUYÞmj ðUUT VÞmj

U mi ’U mi

ðVY T Þmi ðVVT UÞmi

ð3Þ

ð4Þ

3.2. The objective function Since the original tagging data contains noisy tags and missing tags, it is unreasonable and incorrect to treat the tagged tags equally and encode all the unobserved data as 0 in the binary interpretation [9,16,17]. To address the above problems, a novel interpretation scheme is presented based on the context and semantic information. It is reasonable to assume that the tags that are highly relevant in terms of the context and semantic information are likely to appear in the same images. On the other hand, one tag that is irrelevant to other tags in the same image may well

Please cite this article as: Q. Liu, Z. Li, Projective nonnegative matrix factorization for social image retrieval, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2014.09.094i

Q. Liu, Z. Li / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3

be noisy. Thus, we distinguish the relevance of tags based on the co-occurrence (context information) and the WordNet (semantic information) and build a tag afﬁnity matrix C A Rmm (the deﬁnition will be introduced in this section). The original tagging data Y is reﬁned as follows:

deﬁned as 8 2 > < exp ‖xi xj ‖2 σ Sij ¼ > :0

1 X C ; Y^ ij ¼ j T j j k A T ik

where σ is the bandwidth parameter. S can also be deﬁned using sparse coding [44]. Similarly, to preserve the tag local geometric structure, we have

ð5Þ

j

where fT j ¼ fk : Y kj 4 0g. To deal with the out-of-sample problem, we assume that the image latent space is the intrinsic space of the high dimensional visual space by an explicit projection V ¼ AT X

ð6Þ

where A A Rdp is the projection matrix to transform a d-dimensional feature vector into a p-dimensional latent space. To guarantee nonnegative condition of V, the nonnegative constraint is imposed on A and the visual features X are preprocessed to be nonnegative in our experiments. Thus, the objective of our proposed method is to learn the nonnegative projection matrix A and the tag latent feature matrix U. To reduce redundancy, we impose orthogonality constraints on the projection directions, which couples with unitnorm assumption and leads to the constraint of AT A ¼ I. The orthogonal constraint is also to avoid arbitrary scaling and avoid the trivial solution of all zeros [22]. Therefore, based on the above two issues, the objective function is formulated as min U;A

1 ^ α β ‖Y UT AT X‖21 þ ‖U‖2F þ ‖A‖2;1 2 2 2

s:t:

AT A ¼ I;

A Z0;

UZ0

ð7Þ

Here α and β are two nonnegative trade-off parameters. To avoid overﬁtting, two regularization terms, i.e., ‖U‖2F and ‖A‖2;1 , are added into the optimization problem (7). Moreover, the regularization term ‖A‖2;1 ensures that A is sparse in rows to learn an effective transformation matrix [22]. Once the transformation matrix A is learned, the new images can be directly mapped into the learned model and tagged by related tags as follows: vo ¼ AT xo Y io ¼ uTi vo ;

ð8Þ i ¼ 1; …; m

ð9Þ

The new image xo is tagged by tags with high values of Y io . To discover the latent space, we also hope that the space can respect the intrinsic Riemannian structure rather than ambient Euclidean structure, which coincides with the so-called local invariance assumption [42,43]. That is, if two data points xj and xk are close in the intrinsic geometry of the data distribution, the latent representations vj and vk are also close to each other. The local geometric structure can be effectively modeled by a nearest neighbor graph on a scatter of data points [42]. Therefore, to preserve the local geometrical structures of visual space and tag space, it is necessary to construct the visual afﬁnity graph S (the deﬁnition will be introduced in this section) and the tag afﬁnity graph C. Let us take S to sketch the details. Sij presents the visual similarity of images xi and xj . We can employ the following term to preserve the visual local geometric structure:

ΩðVÞ ¼

n vj 1 X vi qﬃﬃﬃﬃﬃﬃ J 22 ¼ Tr½VLS VT S J qﬃﬃﬃﬃﬃﬃ 2 i;j ¼ 1 ij S DSjj Dii

ð10Þ

xi and xj are k nearest neighbors;

ΩðUÞ ¼ Tr½ULC UT :

ð12Þ

For tag afﬁnity matrix C, the context relevance is measured by the co-occurrence in the dataset C cij ¼

corrði; jÞ ; corrði; iÞ þ corrðj; jÞ corrði; jÞ

Here DS is a diagonal matrix with DSii ¼ j ¼ 1 Sij engaged for normalization propose and LS ¼ D 1=2 ðD SÞD 1=2 is the normalized graph Laplacian matrix. The image similarity matrix S is

ð13Þ

where corrði; iÞ is the number of images where the ith and jth tags co-occur. The semantic relevance is estimated based on the WordNet as in [45] C sij ¼

2ICðlcsði; jÞÞ ; ICðiÞ þ ICðjÞ

ð14Þ

where ICðiÞ is the information content of the ith tag and lcsði; jÞ is the least common subsumer of the ith and jth tags in the WordNet taxonomy. The tag afﬁnity graph is constructed by C ij ¼ μC cij þ ð1 μÞC sij ;

ð15Þ

where μ controls the trade-off between the context relevance and the semantic relevance. To preserve the local geometric structures, we combine the above two terms of Eqs. (10) and (12), i.e., we should minimize the following optimization problem:

γ

λ

min ΩðVÞ þ ΩðUÞ; U;V 2 2

ð16Þ

where γ and λ are two nonnegative parameters that balance the trade-off between visual information and tag information. To learn the transformation matrix and the tag latent feature matrix effectively, we jointly exploit the optimization problems (7) and (16). Therefore, the objective function of the proposed PTNMF method is formulated as 1 γ λ min ‖Y^ UT AT X‖2F þ Tr½AT XLS XT A þ Tr½ULC UT 2 2 A;U 2

α

β

þ ‖U‖2F þ ‖A‖2;1 2 2 s:t: AT A ¼ I; A Z 0;

U Z0:

ð17Þ

Algorithm 1. The proposed PTNMF algorithm. Input: ^ Image Image feature matrix X; Tag assignment matrix Y; similarity matrix S; Tag correlation matrix C; Parameters β, γ, λ and η; The rank p.

α,

1: Calculate LS and LC based on S and C; 2: The iteration step t ¼ 1; 3: Initialize Ut A Rpm þ ; 4: Set At A Rdp and Dt A Rdd as identity matrices; 5: repeat 6: Update U: T ððAt ÞT XY^ þ λUt LC Þ

U tmiþ 1 ’U tmi ððAt ÞT XXT At Ut þ λUt LC þmiαUt Þ ; þ

Pn

ð11Þ

otherwise;

7:

Atjmþ 1 ’Atjm 8:

mi

Update A: ðXYT ðUt ÞT þ γ XLS XT At þ ηAt Þjm ðXXT AT Ut ðUt ÞT þ γ XLSþ XT At þ ηAt ðAt ÞT At þ βDt At Þjm

;

Update D:

Please cite this article as: Q. Liu, Z. Li, Projective nonnegative matrix factorization for social image retrieval, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2014.09.094i

Q. Liu, Z. Li / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

6 Dt þ 1 ¼ 6 4

1 ‖ðat þ 1 Þ1 ‖2

3 ⋯ 1 ð‖at þ 1 Þd ‖2

4

3.8

7 7; 5

9: t ¼ t þ1; 10: until Convergence criterion satisﬁed Output: Projection matrix A and latent tag feature matrix U.

3.3. Optimization In this subsection, we present an algorithm to solve the optimization problem in (17). First, the objective function of P2MF is rewritten as follows: 1 γ λ min ‖Y^ UT AT X‖2F þ Tr½AT XLS XT A þ Tr½ULC UT A;U 2 2 2

η

α

A Z 0;

β

þ ‖AT A I‖2F þ ‖U‖2F þ ‖A‖2;1 þ Tr½Ψ UT þ Tr½ΦAT : 4 2 2

ð19Þ

The partial derivatives of L with respect to U and V are ∂L ¼ AT XðY^ UT AT XÞT þ λULC þ αUþ Ψ ∂U ∂L ¼ XðY^ UT AT XÞT UT þ γ XLS XT A þ ηðAAT A AÞ ∂A þ βDA þ Φ

ð20Þ

ð21Þ

Using the Karush–Kuhn–Tucker (KKT) conditions [46] ψ mi U mi ¼ 0 and ϕjm Ajm ¼ 0, we obtain the following equation for U mi and Ajm : T

ðAT XXT AU þ λULC þ αU AT XY^ Þmi U mi ¼ 0

ð22Þ

T ðXXT AUUT þ γ XLS XT A þ ηAAT A þ βDA XY^ UT þ ηAÞjm Ajm ¼ 0:

ð23Þ where L þ ¼ ðj Lj þ LÞ=2 Introducing L ¼ Lþ L , L ¼ ðj Lj LÞ=2, we obtain the following updating rules: T ðAT XY^ þ λULC Þmi

ðAT XXT AU þ λULCþ þ αUÞmi ðXY T UT þ γ XLS XT A þ ηAÞjm

ðXX AUUT þ γ XLSþ XT A þ ηAAT A þ βDAÞjm T

3.74 3.73 3.72 3.71 0

50

100

150

200

Fig. 1. The convergence curve of the proposed optimization algorithm in Algorithm 1.

1 γ λ L ¼ ‖Y^ UT AT X‖2F þ Tr½AT XLS XT A þ Tr½ULC UT 2 2 2

Ajm ’Ajm

3.75

ð18Þ

we ﬁx it as 108 in our experiments to insure the orthogonality satisﬁed. Since U; A Z 0, we introduce the Lagrangian multipliers ψmi and ϕjm for constraints U mi Z 0 and Ajm Z 0 respectively. Let Ψ ¼ ðψ mi Þ and Φ ¼ ϕjm . The Lagrangian function is

U mi ’U mi

3.76

β

UZ0

α

3.77

Number of Iterations

ηis a positive parameter to control the orthogonality condition and

η

3.78

3.7

þ ‖AT A I‖2F þ ‖U‖2F þ ‖A‖2;1 4 2 2 s:t:

x 10

3.79

Objective Function Value

2

and

ð24Þ

ð25Þ

We introduce an iterative procedure to optimize (18) and present the details in Algorithm 1. The proposed optimization algorithm is convergent, which can be proved by following the strategy in [41]. In the experiments, we observed that the proposed optimization converges quickly. Fig. 1 shows the changing values of the objective function in the convergence process. We perform our experiments on the MATLAB in a PC with 3.2 GHz CPU and 4 GB memory.

4. Experiments To validate the effectiveness of our proposed approach on tag relevance learning, we conduct extensive experiments, and apply our method to text-based social image retrieval and automatic image tagging. 4.1. Experimental setting We conduct experiments on the real-world image dataset NUSWIDE-Lite [47], which contains 55,615 images with 5018 unique tags. It provides the ground-truth annotations of images with 81 concepts. Note that the tags are rather noisy and many of them are misspelling or meaningless words. Hence, a pre-processing was performed to ﬁlter out these tags. Moreover, to avoid sample insufﬁciency issue in optimization, we remove those tags whose occurrence numbers are below 50. Consequently, 1805 unique tags were obtained in total. To represent the visual content of photos, we construct 1000-dimensional bag-of-visual-words representation by extracting local features from images using SIFT descriptor. We randomly split the dataset into two subsets. One subset with 5000 images is adopted as the learning data to learn the proposed model and also used to validate the performance of tag reﬁnement. The rest images are utilized as testing data to test the performance of image tagging. During the partition process, each tag is guaranteed to be associated with at least one images. To alleviate the instability introduced by the randomly selected learning data, we independently repeat experiments 5 times to generate different learning and testing data, and report the average results. The results on the noisy tagged learning data and the testing data are both reported. To evaluate the performance, we evaluate the performance on 81 concepts in NUS-WIDE-Lite where the ground-truth annotations of these tags have been provided. The performance is evaluated by using F1 measure, which is j Nc j j Nc j deﬁned as F1 ¼ 2RP R þ P , where R ¼ j N g j , and P ¼ j N t j . Here j N g j be

the number of images tagged with one concept w in the ground truth, j Nt j be the number of images tagged with w of our algorithm, and j Nc j be the number of correct tagged images with w by our algorithm. The mean F1 over 81 concepts is presented. Besides, we use Average Precision (AP) to evaluate the tag ranking performance. It corresponds to the average of the precision at each position where an accurate tag appears. Let PðkÞ measure the

Please cite this article as: Q. Liu, Z. Li, Projective nonnegative matrix factorization for social image retrieval, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2014.09.094i

Q. Liu, Z. Li / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

percentage of accurate tags within the top k positions of the ranking. We have m 1 X AP ¼ PðkÞ relðkÞ; j RG j k ¼ 1

a

where relðkÞ is an indicator function equaling 1 if the tag at rank k is tagged by the corresponding user, and zero otherwise. Mean Average Precision (MAP) is obtained by averaging the APs on 81 concepts For experimental setting, there are several parameters in our algorithm needed to be set in advance. The trade-off parameter α in Eq. (17) is set to 0.005 empirically. The parameter η is set to 108 to guarantee the orthogonality satisﬁed. The parameters γ and λ, which balance the information of local visual geometry and tag geometry, are set to 0.001 and 0.01 empirically. For the rank of latent factors p and the regularization parameter β, we tune them by a “grid-search” strategy [48] and conduct sensitivity analysis in Section 4.3.

MAP (%)

ð26Þ

45 40 35 30 25 20 15 10 5 0 50

b

5

100

200

p

300

500

1000

50 45 40

4.2. Compared methods To evaluate the performance of the proposed PRNMF method, we compared it extensively with the following methods:

MAP (%)

35 30 25 20 15

OT: i.e., the original tags associated with images. CIAR: the tag reﬁnement algorithm of Content-based Image

10

Annotation Reﬁnement proposed in [14]. PRW : the tag ranking method proposed in [15], which can be viewed as a combination of the Probabilistic tag ranking and Random Walk-based tag ranking. TRNV : the Tag Relevance by Neighbor Voting learning algorithm [2]. MPMF: the proposed Multi-correlation Probabilistic Matrix Factorization method in [9]. TWTV : the two-view Tag Weighting method combines the local information in the Tag space and Visual space [17].

0

For the above methods, new images are tagged based on the KNN method. For a new image xo , we ﬁrst ﬁnd its nearest neighbors N xo in the learning data based on the visual information and then estimate its relevance to tags as follows. The relevance P S r score of xo to the jth tag r oj is calculated by r oj ¼ k A N P ko kj , xo

k A N xo

Sko

5 0

1e−006

0.0001

0.01

1

100

10000

1e+006

β

Fig. 2. The MAPs of the proposed PTNMF with varying the parameters. (a) The dimension of the latent factors p. (b) The parameter β.

Table 1 The results of the proposed PJNMF method and the compared methods. Method

OT CAIR [14] PRW [15] TRNV [2] MPMF [9] TWTV [17] PTNMF

Image reﬁnement

Image tagging

F1 (%)

MAP (%)

F1 (%)

MAP (%)

41.98 7 1.18 42.277 0.94 43.97 7 1.08 43.43 70.42 44.30 70.92 44.87 70.80 45.32 7 0.19

36.80 7 1.03 37.20 7 0.60 38.39 7 0.51 37.41 71.05 37.60 71.24 38.747 0.71 39.23 70.74

– 14.38 70.80 13.64 7 0.57 13.72 7 0.21 13.84 7 1.03 13.977 0.68 15.30 7 0.25

– 6.42 7 0.14 7.07 7 0.07 6.02 7 0.15 6.09 7 0.22 7.23 7 0.14 12.13 71.03

where Sko is the visual similarity between the new image xo and the image xk , and r kj is the learned relevance score of xk to the jth tag. The new image is tagged based on the estimated relevance scores.

the image latent features while it is set to the suitable values. If it is too small, it cannot ﬁlter out the noisy visual features while it removes too many visual features with large values.

4.3. Sensitivity analysis of parameters

4.4. Experimental analysis

There are several free parameters needed to be analyzed in our method. p controls the dimension of the latent factors. β determines the importance of the ℓ2;1 -norm regularization term. The corresponding results are presented in Fig. 2. Fig. 2(a) illustrates the results in terms of MAP of implementing PTNMF on the learning data while varying the dimensionality of the latent factors. From the results, we observe that the MAP is improved by increasing the dimensionality of the uniﬁed space to some extent. Considering high dimension corresponds to expensive computing cost, we set p¼ 300 to leverage the performance and the cost. We also test the effect of varying the parameter β on the performance and present the corresponding MAP on the learning data in Fig. 2(b). It is observed that the introduced ℓ2;1 -norm regularization term is effective while it is not too small or too large. It can select effective features to transform the visual features into

To demonstrate the effectiveness of the proposed method PTNMF, we compare it with the compared methods. For each method, we report its best results by tuning its parameters. MAP of these methods is illustrated in Table 1. We also present the detailed performances in terms of F1 and APs over the 81 concepts on the learning data in Figs. 3 and 4, respectively. From the experimental results, we can draw the following observations. First of all, the proposed algorithm PTNMF simultaneously achieves the best results on the learning data and testing data, which demonstrates the performance of PTNMF. Second, all the tag relevance learning methods outperforms OT signiﬁcantly, which veriﬁes the necessity of tag reﬁnement. This also demonstrates that the relevant tags may not be placed at the top position. Third, the best performance of the proposed method for tagging new social images validates its effectiveness for the out-of-sample problem.

Please cite this article as: Q. Liu, Z. Li, Projective nonnegative matrix factorization for social image retrieval, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2014.09.094i

Q. Liu, Z. Li / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

OT CIAR PRW TRNV TWTV MPMF PTNMF

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

OT CIAR PRW TRNV TWTV MPMF PTNMF

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

OT CIAR PRW TRNV TWTV MPMF PTNMF

Fig. 3. The detailed performance comparison in terms of F1 over 81 concepts on the learning data. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

OT CIAR PRW TRNV TWTV MPMF PTNMF

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

OT CIAR PRW TRNV TWTV MPMF PTNMF

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

OT CIAR PRW TRNV TWTV MPMF PTNMF

Fig. 4. The detailed performance comparison in terms of AP over 81 concepts on the learning data.

Please cite this article as: Q. Liu, Z. Li, Projective nonnegative matrix factorization for social image retrieval, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2014.09.094i

Q. Liu, Z. Li / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Our method is signiﬁcantly superior to the other approaches for processing out-of-samples. This validates the beneﬁts of our motivation towards out-of-samples. Finally, the retrieval performance of PTNMF outperforms others. It demonstrates that our method can rank the relevant tags at the top positions. The proposed method PTNMF is effective for social image retrieval.

5. Conclusion In this paper, we propose a new nonnegative matrix factorization approach to estimate the tag relevance. To address the out of sample problem, we introduce a transformation matrix with ℓ2;1 -norm regularization to map the visual features to the image latent features. The ℓ2;1 -norm regularization makes the transformation matrix suitable for selecting the effective features. The image visual similarity and tag correlation are incorporated simultaneously to preserve the local visual geometry and local textual geometry. For an untagged image, we ﬁrst map it into the image latent space using the learned transformation matrix, and then calculate its relevance to tags via multiply its latent feature by the tag latent features. Empirical results on a real dataset have demonstrated the effectiveness of our method.

Acknowledgments This work was supported by the 863 Program (2014AA015101), the National Natural Science Foundation of China (Grant nos. 61402228, 61103059 and 61272329), the Program for New Century Excellent Talents in University under Grant NCET-12-0632, the Natural Science Foundation of Jiangsu Province under Grant BK2012033, and Open Projects Program of National Laboratory of Pattern Recognition.

References [1] S.A. Golder, B.A. Huberman, Usage patterns of collaborative tagging systems J. Inf. Sci. 32 (2) (2006) 198–208. [2] X. Li, C.G.M. Snoek, M. Worring, Learning social tag relevance by neighbor voting, IEEE Trans. Multimed. 11 (7) (2009) 1310–1322. [3] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D.M. Blei, M.I. Jordan, Matching words and pictures, J. Mach. Learn. Res. 3 (2003) 1107–1135. [4] J. Li, J.Z. Wang, Real-time computerized annotation of pictures, IEEE Trans. Pattern Anal. Mach. Intell. 30 (6) (2008) 985–1002. [5] V. Lavrenko, R. Manmatha, J. Jeon, A model for learning the semantics of pictures, Adv. Neural Inf. Process. Syst. (2004) 553–560. [6] F. Wang, C. Zhang, Label propagation through linear neighborhoods, in: Proceedings of IEEE International Conference on Machine Learning, 2006, pp. 55–67. [7] J. Liu, M. Li, Q. Liu, H. Lu, S. Ma, Image annotation via graph learning, Pattern Recognit. 42 (2) (2009) 218–228. [8] X. He, R. Zemel, Learning hybrid models for image annotation with partially labeled data, Adv. Neural Inf. Process. Syst. (2009) 625–632. [9] Z. Li, J. Liu, X. Zhu, T. Liu, H. Lu, Image annotation using multi-correlation probabilistic matrix factorization, in: Proceedings of ACM International Conference on Multimedia, 2010, pp. 1187–1190. [10] J. Tang, H. Li, G.-J. Qi, T.-S. Chua, Image annotation by graph-based inference with integrated multiple/single instance representation, IEEE Trans. Multimed. 12 (2) (2010) 131–141. [11] R. Hong, M. Wang, Y. Gao, D. Tao, X. Li, X. Wu, Image annotation by multipleinstance learning with discriminative feature mapping and selection, IEEE Trans. Cybernet. 44 (5) (2014) 669–680. [12] R. Salakhutdinov, A. Mnih, Probabilistic matrix factorization, Adv. Neural Inf. Process. Syst. (2007) 361–370. [13] J. Weston, S. Bengio, N. Usunier, Wsabie: scaling up to large vocabulary image annotation, in: Proceedings of International Joint Conference on Artiﬁcial Intelligence, 2011, pp. 2764–2770. [14] C. Wang, F. Jing, L. Zhang, H.-J. Zhang, Content-based image annotation reﬁnement, in: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.

7

[15] D. Liu, X.-S. Hua, L. Yang, M. Wang, H.-J. Zhang, Tag ranking, in: Proceedings of ACM International Conference on World Wide Web, 2009, pp. 351–360. [16] G. Zhu, S. Yan, Y. Ma, Image tag reﬁnement towards low-rank, content-tag prior and error sparsity, in: Proceedings of ACM International Conference on Multimedia, 2010, pp. 461–470. [17] J. Zhuang, S.C.H. Hoi, A two-view learning approach for image tag ranking, in: Proceedings of ACM Conference on Web Search and Data Mining, 2011, pp. 625–634. [18] J. Sang, C. Xu, J. Liu, User-aware image tag reﬁnement via ternary semantic analysis, IEEE Trans. Multimed. 14 (3-2) (2012) 883–895. [19] Z. Li, J. Liu, H. Lu, Nonlinear matrix factorization with uniﬁed embedding for social tag relevance learning, Neurocomputing 105 (2013) 38–44. [20] J. Wright, A. Ganesh, S. Rao, Y. Peng, Y. Ma, Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimization, Adv. Neural Inf. Process. Syst. (2009) 2080–2088. [21] S. Rendle, L.B. Marinho, A. Nanopoulos, L. Schmidt-Thieme, Learning optimal ranking with tensor factorization for tag recommendation, in: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Dining, 2009, pp. 727–736. [22] Z. Li, J. Liu, Y. Yang, X. Zhou, H. Lu, Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Trans. Knowl. Data Eng. 9 (26) (2014) 2138–2150. [23] P. Duygulu, K. Barnard, N. de Freitas, D. Forsyth, Object recognition as machine translation: Learning a lexicon for a ﬁxed image vocabulary, in: Proceedings of European Conference on Computer Vision, 2002, pp. 97–112. [24] J. Jeon, V. Lavrenko, R. Manmatha, Automatic image annotation and retrieval using cross-media relevance models, in: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 119–126. [25] S. Feng, R. Manmatha, V. Lavrenko, Multiple bernoulli relevance models for image and video annotation, in: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2004, pp. 1002–1009. [26] J. Liu, B. Wang, M. Li, W. Ma, H. Lu, S. Ma, Dual cross-media relevance model for image annotation, in: Proceedings of ACM International Conference on Multimedia, 2007, pp. 605–614. [27] X. Liu, R. Ji, H. Yao, P. Xu, X. Sun, T. Liu, Cross-media manifold learning for image retrieval & annotation, in: Proceedings of ACM International Conference on Multimedia Information Retrieval, 2008, pp. 141–148. [28] A. Makadia, V. Pavlovic, S. Kumar, A new baseline for image annotation, in: Proceedings of European Conference on Computer Vision, 2008, pp. 316–329. [29] C. Wang, S. Yan, L. Zhang, H.-J. Zhang, Multi-label sparse coding for automatic image annotation, in: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2009, pp. 1643–1650. [30] M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid, Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation, in: Proceedings of IEEE International Conference on Computer Vision, 2009, pp. 309–316. [31] C. Wang, F. Jing, L. Zhang, Image annotation reﬁnement using random walk with restarts, in: Proceedings of ACM International Conference on Multimedia, 2006, pp. 647–650. [32] L. Wu, L. Yang, N. Yu, X.-S. Hua, Learning to tag, in: Proceedings of ACM International Conference on World Wide Web, 2009, pp. 361–370. [33] J. Tang, S. Yan, R. Hong, G.-J. Qi, T.-S. Chua, Inferring semantic concepts from community-contributed images and noisy tags, in: Proceedings of ACM International Conference on Multimedia, 2009, pp. 223–232. [34] Z.-J. Zha, L. Yang, T. Mei, M. Wang, Z. Wang, Visual query suggestion, in: Proceedings of ACM Multimedia, 2009, pp. 15–24. [35] X. Li, C.G.M. Snoek, M. Worring, Unsupervised multi-feature tag relevance learning for social image retrieval, in: Proceedings of ACM International Conference on Image and Video Retrieval, 2010, pp. 10–17. [36] J. Tang, R. Hong, S. Yan, T.-S. Chua, G.-J. Qi, R. Jain, Image annotation by knnsparse graph-based label propagation over noisily-tagged web images, ACM Trans. Intell. Syst. Technol. 2 (2) (2011) 14. [37] Z.-J. Zha, L. Yang, T. Mei, M. Wang, Z. Wang, T.-S. Chua, X.-S. Hua, Visual query suggestion: towards capturing user intent in Internet image search, ACM Trans. Multimed. Comput. Commun. Appl. 6 (3) (2010) 13. [38] J. Tang, Z.-J. Zha, D. Tao, T.-S. Chua, Semantic-gap oriented active learning for multi-label image annotation, IEEE Trans. Image Process. 21 (4) (2012) 2354–2360. [39] Z. Li, J. Liu, J. Tang, H. Lu, Projective matrix factorization with uniﬁed embedding for social image tagging, Comput. Vis. Image Underst. 124 (2014) 71–78. [40] S.K. Gupta, D. Phung, B. Adams, T. Tran, S. Venkatesh, Nonnegative shared subspace learning and its application to social media retrieval, in: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Dining, 2010, pp. 1169–1178. [41] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, Adv. Neural Inf. Process. Syst. 13 (2001) 556–562. [42] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, Adv. Neural Inf. Process. Syst. (2001) 585–591. [43] X. He, P. Niyogi, Locality preserving projections, Adv. Neural Inf. Process. Syst. (2003) 153–160. [44] Z. Li, J. Liu, H. Lu, Sparse constraint nearest neighbour selection in cross-media retrieval, in: Proceedings of IEEE International Conference on Image Processing, 2010, pp. 1465–1468. [45] D. Liu, S. Yan, X.-S. Hua, H.-J. Zhang, Image retagging using collaborative tag propagation, IEEE Trans. Multimed. 13 (4) (2011) 702–712.

Please cite this article as: Q. Liu, Z. Li, Projective nonnegative matrix factorization for social image retrieval, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2014.09.094i

8

Q. Liu, Z. Li / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

[46] H. Kuhn, A. Tucker, Nonlinear programming, in: Proceedings of Second Berkeley Symposium on Mathematical Statistics and Probabilistics, University of California Press, Oakland, CA, 1951. [47] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y.-T. Zheng, Nus-wide: a real-world web image database from National University of Singapore, in: Proceedings of ACM International Conference on Image and Video Retrieval, 2009, pp. 48: 1–9. [48] C.-W. Hsu, C.-C. Chang, C.-J. Lin, A Practical Guide to Support Vector Classiﬁcation, 〈http://www.csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf.〉.

Zechao Li received the B.E. degree from the University of Science and Technology of China (USTC), Anhui, China, in 2008, and the Ph.D. degree from National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences in 2013. He is an assistant professor with the School of Computer Science, Nanjing University of Science and Technology. His current research interests include matrix factorization, correlation mining, social multimedia analysis and applications.

Qiuli Liu received the M.E. degree from Nanjing Normal University. Currently, she is with the Capital University of Economics and Business. Her current research interests include art design.

Please cite this article as: Q. Liu, Z. Li, Projective nonnegative matrix factorization for social image retrieval, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2014.09.094i

Projective nonnegative matrix factorization for social image retrieval

Projective nonnegative matrix factorization for social image retrieval

Recommend Documents