Semantic consistency hashing for cross-modal retrieval

Semantic consistency hashing for cross-modal retrieval

Neurocomputing 193 (2016) 250–259 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Semanti...

3MB Sizes 0 Downloads 71 Views

Neurocomputing 193 (2016) 250–259

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Semantic consistency hashing for cross-modal retrieval$ Tao Yao a,b, Xiangwei Kong a,n, Haiyan Fu a, Qi Tian c a b c

School of Information and Communication Engineering, Dalian University of Technology, Dalian 116023, China Department of Information and Electrical Engineering, LuDong University, Yantai 264025, China Department of Computer Science, University of Texas at San Antonio, San Antonio 78249, USA

art ic l e i nf o

a b s t r a c t

Article history: Received 19 October 2015 Received in revised form 23 December 2015 Accepted 8 February 2016 Communicated by Tao Mei Available online 20 February 2016

The task of cross-modal retrieval is to query similar objects in dataset of multi-modality, such as using text to query images and vice versa. However, most of existing works suffer from high computational complexity and storage cost in large-scale applications. Recently, hashing method mapping the highdimensional data to compact binary codes has attracted a lot of concerns due to its efficiency and low storage cost over large-scale dataset. In this paper, we propose a Semantic Consistency Hashing (SCH) method for cross-modal retrieval. SCH learns a shared semantic space simultaneously taking both intermodal and intra-modal semantic correlations into account. In order to preserve the inter-modal semantic consistency, an identical representation is learned using non-negative matrix factorization for the samples with different modalities. Meanwhile, neighbor preserving algorithm is adopted to preserve the semantic consistency in each modality. In addition, an effective optimal algorithm is proposed to reduce the time complexity from traditional OðN 2 Þ or higher to O(N). Extensive experiments on two public datasets demonstrate that the proposed approach significantly outperforms the existing schemes. & 2016 Elsevier B.V. All rights reserved.

Keywords: Cross-modal retrieval Semantic consistency Hashing Non-negative matrix factorization Neighbor preserving

1. Introduction With the rapid development of information technology and the Internet, one webpage may contain text, audio, image, video and so on. Although these data are represented by different modalities, they have strong semantic correlation. For example, Fig. 1 displays a number of documents collected from Wikipedia. Each document includes one figure along with surrounding texts. These pairwise images and texts are connected by blue solid line denoting that they have strong semantic correlation. And those connected by blue dotted line mean that the image is relevant to these texts, i.e. they have the same semantic concept, while those connected by red dotted line denote that they are irrelevant to each other. The task of cross-modal retrieval is using one kind of media to retrieve similar samples in dataset of different modalities, and the returned samples are ranked by the correlation. However, with the explosive growth of multimedia on the Internet, storage cost and efficiency are two main challenges in large-scale retrieval. Hashing method mapping sample from high-dimensional feature space to low-dimensional binary Hamming space has been received much attention due to its efficiency and low memory cost [1–13]. However, most of existing hashing schemes can work only ☆ Fully documented templates are available in the elsarticle package on CTAN http://www.ctan.org/tex-archive/macros/latex/contrib/elsarticle. n Corresponding author.

http://dx.doi.org/10.1016/j.neucom.2016.02.016 0925-2312/& 2016 Elsevier B.V. All rights reserved.

on single modality [1–6]. There have been only a few works addressing multi-modal retrieval so far [7,9–12]. Multi-modal hashing generally can be categorized into two types: multimodal fusion hashing (MMFH) and cross-modal hashing (CMH). MMFH aims at generating better binary codes by taking advantage of the complementarity of each modality than single modal hashing [7]. While the CMH method is to construct a shared Hamming space to retrieve similar samples over heterogeneous cross-modal dataset [9–12]. In this paper, we focus on CMH method. The key point of CMH is to find the correlation between different modalities in Hamming space. However, how to learn a low-dimensional Hamming space over heterogeneous cross-modal dataset is still a challenging issue. There have been many recent works focus on this issue. For example, Canonical Correlation Analysis (CCA) hashing method maps the sample from different modalities to a low-dimensional Hamming space by maximizing the correlation between different modalities [14]. Multimodal latent binary embedding (MLBE) employs a binary latent factor with a probabilistic model to learn hashing codes [15]. Co-Regularized Hashing (CRH) is proposed to learn a low-dimensional hamming space by mapping the data far from zero for each bit, and inter-modal similarity is effectively preserved at the same time [8]. Multimodal NN hashing (MMNNH) proposed in [16] aims at learning a group of hashing functions by preserving intra-modal and inter-modal similarity. However, above cross-modal hashing methods directly learn hashing functions respectively for each modality. It may degrade the

T. Yao et al. / Neurocomputing 193 (2016) 250–259

251

Boston College won the ceremonial pre-game coin toss to determine first possession and elected to kick off to begin the game, ensuring that the Eagles would receive the ball to begin the second half. Virginia Tech kick returner Dyrell Roberts fielded the ball at the Tech six-yard line and returned it to the Tech 33-yard line before the first play of the game...

Of Sicily's own form of Baroque, post 1693, it has been said, "The buildings conceived in the wake of this disaster expressed a light-hearted freedom of decoration whose incongruous gaiety was intended, perhaps, to assuage the horror".Mary Miers, ''Country Life'' (1 Nov. 2004); reproduced on the website of John Martin Gallery, London...

After winning three consecutive Super 12 titles, the Crusaders finished tenth in 2001 — their worst finish since 1996. The season was the last for captain Todd Blackadder before he left to play for Edinburgh in Scotland.Before leaving for Scotland, Blackadder led the Canterbury NPC team to victory in the 2002 National Provincial Championship.The Crusaders ...

Between 1878 and 1884, Ipswich Town played at two grounds in the town, Broom Hill and Brook's Hall, but in 1884, the club moved to Portman Road and have played there ever since. At their new home, Ipswich became one of the first clubs to implement the use of goal nets, in 1890, but the more substantial elements of ground development did not begin until, in 1901...

Every year some 50,000 invited guests are entertained at garden parties, receptions, audiences, and banquets. The Garden Parties, usually three, are held in the summer, usually in July. The Forecourt of Buckingham Palace is used for Changing of the Guard, a major ceremony and tourist attraction (daily during the summer months; every other day during the winter)...

Fig. 1. A number of examples are collected from Wikipedia. The correlations between images and texts are denoted by different lines and colors. The blue solid line represents that the image and texts have strong semantic correlation with each other. The blue dashed line represents that they are relevant to each other. The red dotted line represents that they are irrelevant to each other. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

performance because the learned Hamming space is not semantically distinguishing. To address above issue, a supervised scheme (SliM2) [17] is proposed to embed heterogeneous data into a semantic space by dictionary learning and sparse coding. In [18], Latent Semantic Sparse Hashing (LSSH) algorithm is proposed to learn a semantic space by sparse coding and matrix factorization. Sparse coding is used to capture the salient structure of text, and matrix factorization is used to learn the latent concept for image. At last, learning a linear mapping matrix bridges the semantic space between the text and the image. Collective Matrix Factorization Hashing (CMFH) [19] intends to project sample to a common semantic space by collective matrix factorization, thus inter-modal semantic similarity is preserved effectively. The results of above methods prove that learning a semantic space is helpful to crossmodal retrieval. However, those methods only consider to preserve inter-modal semantic consistency, but ignore to preserve intra-modal semantic consistency. Inter-modal semantic consistency aims at preserving the global similarity structure, while intra-modal semantic consistency aims at preserving the local similarity structure for each modality in the learned lowdimensional semantic space. Moreover, recent studies have proved that samples from high-dimensional space actually lie on a low-dimensional manifold in real-world [20,21]. Hence it will be beneficial for introducing the intra-modal semantic consistency to cross-modal retrieval framework.

In this paper, we put forward a semantic consistency hashing method for cross-modal retrieval. We aim to efficiently learn binary codes for different modalities by jointing intra-modal and inter-modal semantic consistency into a framework. In order to preserve inter-modal semantic consistency, an identical representation is learned by non-negative matrix factorization (NMF) for the samples with different modalities. The main advantages of NMF are as follows: (1) Nonnegative representation is consistent with the cognition of human brain [22,23]. (2) The constraint of non-negative brings sparse, and relatively sparse representation can resist noise to a certain extent [24], which will be beneficial for learning the shared semantic space with the noisy labels. In order to preserve intra-modal semantic consistency, neighbor preserving algorithm is utilized to preserve the local similarity structure. This allows to exploit richer information existing in data to learn a better shared semantic space. Our main contributions are as follows: 1. We propose a semantic consistency hashing method to effectively find the semantic correlation between different modalities in the shared semantic space. Not only the inter-modal semantic consistency is preserved by NMF, but also the intramodal semantic consistency is preserved by neighbor preserving algorithm in the shared semantic space. 2. We propose an efficient and iterative optimization framework. In experiments, we find that satisfactory performance can be achieved in about 10–20 iterations. Meanwhile, the training

252

T. Yao et al. / Neurocomputing 193 (2016) 250–259

time complexity is reduced from traditional OðN 2 Þ or higher (such as proposed in [8,15], and [19]) to O(N). 3. As for performance, SCH significantly outperforms existing approaches. Extensive experiments on two public datasets show that SCH outperforms baseline algorithms by a maximum of 16.48% on mean average precision. At last, the rest of this paper is organized as follows. In Section 2, we briefly introduce the notion of non-negative matrix factorization. In Section 3, we present components of the proposed method, including the formulation of SCH, optimization, generating hashing codes and complexity analysis. In Section 4, we report experimental results on two public datasets and give experimental analysis. Finally, Section 5 provides concluding remarks.

2. Non-negative matrix factorization NMF has been widely applied to many fields [25–29], and has attracted much attention owning to its theoretical interpretation and excellent performance. The algorithm of NMF is as follows. Given a matrix M ¼ fm1 ; m2 ; …; mN g A R þ , mi is a d-dimensional vector denoting one sample. NMF aims at gaining two matrices and V A RPN (where P is the dimension of latent space) U A RdP þ þ satisfying M  UV. U and V can be obtained by minimizing the following objective function: J M  UV J 2F

L ¼ arg min U;V

s:t:

U Z 0;

V Z0

ð1Þ

where J J F denotes Frobenius norm. The above problem is nonconvex, but it is convex when one variable is fixed. We can use the iterative updating scheme proposed in [30] to solve Eq. (1): U i; j ¼ U i; j

ðXV T Þi; j

ð2Þ

ðUVV T Þi; j

Ointer ¼

ðU XÞi; j

ð3Þ

ðU T UVÞi; j

It has been proved that the two iterative updating steps in Eqs. (2) and (3) can efficiently lead to a local minimum for Eq. (1). In this paper, NMF is utilized to learn the shared semantic space.

3. Semantic consistency hashing In this section, we will describe the algorithm of SCH in detail. Firstly, we present main formulation of SCH. Then a three-step iterative algorithm is proposed to optimize the problem. At last generating hashing codes and complexity analysis algorithm are presented. 3.1. Formulation Suppose that X ¼ fX ð1Þ ; X ð2Þ ; …; X ðmÞ g, where X ðiÞ denotes the ith modality of data, and the number m is the size of all modalðiÞ ðiÞ ðiÞ di ities. X ðiÞ ¼ fxðiÞ 1 ; x2 ; …; xN g, xk A R (generally di a dj , if ia j), where di is the dimension of feature, and N is the number of samples. ; xð2Þ …xðmÞ g is the representation of the kth sample data of all fxð1Þ k k k modalities. The purpose of hashing method is to learn a set of i

p

hashing functions H ¼ fhj ; Rdi -Y j g1 mapping samples to binary P codes. Y ¼ f  1; 1gpN , s.t. i;j Y i;j ¼ 0, p is the length of hashing codes, and each column of Y denotes a hashing code of one sample.

m X

λi J X ðiÞ  U ðiÞ V J 2F

i¼1

s:t:

U ðiÞ Z 0; V Z 0: ðiÞ

di p

ð4Þ pN

where U A R ; V AR , V is the shared semantic space, p is the dimension of the shared semantic space, and λi is a positive tradeoff parameter controlling the weight of each modality. Then the inter-modal semantic consistency can be preserved in the shared semantic space. Most recent hashing based works take advantage of preserving the global similarity structure to implement cross-modal retrieval [17–19]. Actually, local similarity structure is often more important than global one [21]. Some recent works have achieved excellent performance by taking advantage of that [8,23]. Inspired by this, in this paper, we aim at preserving the local similarity structure by constructing a K nearest neighbors graph. Then the intra-modal loss function is defined as follows: Ointra ¼

T

V i; j ¼ V i; j

The goal of SCH is to learn hashing codes by attempting to effectively bridge the semantic gap between different modalities. We expect that similar samples in high-dimensional feature space are still similar in the learned low-dimensional semantic space. In this paper, the proposed SCH method utilizes NMF to learn the shared semantic space, as well as adds similarity preserving constraint into our NMF model. NMF has the intrinsic properties for effectively finding a low-dimensional subspace for highdimensional feature [25,27,23]. In this paper, NMF is extended to multi-modal application to effectively preserve the semantic correlation between different modalities, the two factor matrices in each modality are jointly learned. One sample is represented by different modalities in heterogeneous datasets, obviously, these representations have the same semantic concept. Accordingly, we use an identical vector to represent these samples in the shared semantic space. Then the objective function for preserving the inter-modal semantic consistency is defined as follows:

m X 1X μi wij;k J vj vk J 2F 2 i ¼ 1 j;k

ð5Þ

where μi is a parameter controlling the weight of the ith modality, ðiÞ and wij;k represents the similarity between xðiÞ j and xk . Following the idea proposed in [31], a heat kernel is used to construct the weight matrix Wi whose elements are wijk . Furthermore, the definition of wij;k is as follows: 8 <  J xðiÞj  xðiÞk J 2F e if xðiÞ A KNNðxðiÞ i j Þ k σ2 wj;k ¼ ð6Þ : 0 else ðiÞ where KNNðxðiÞ j Þ denotes K neighbors of xj , and σ is a tuning i parameter. Generally, σ ¼ 1, wjj ¼ 0. However, the computational cost is too high to obtain the K neighbors of each training sample over large-scale dataset. In this paper, we use a simple function to approximately represent the weight matrix, and the definition of wij;k is as follows: ( Þ ¼ ¼ labelðxðiÞ 1 if labelðxðiÞ j Þ k wij;k ¼ ð7Þ 0 else ðiÞ where label (xðiÞ j ) indicates the label of xj . We can randomly select K samples with the same label for each training sample to construct KNN graph. In experiments, we find that the performance of using Eq. (7) is almost equal to the performance of using Eq. (6) when the value of K is relatively small. Then to handle large-scale applications, we only need a small number of samples with labels in training dataset to construct the weight matrix Wi, which can reduce the labelling cost and time complexity.

T. Yao et al. / Neurocomputing 193 (2016) 250–259

Given the matrix of Wi, the intra-modal loss function in each modality can be rewritten as: X 1X i 1X w J vj  vk J 2F ¼ vj Dijj vTj  vj vTk wij;k 2 j;k j;k 2 j;k j;k ¼ trðVDi V T Þ  trðVW i V T Þ ¼ trðVLi V T Þ i

i

i

ð8Þ

i

where L ¼ D  W , D is a diagonal matrix. And diagonal elements of Di are the column (or row, Wi is a symmetrical matrix) sum of P Wi, that is, Dijj ¼ k wij;k . Obviously Li is a Laplacian matrix. Then the intra-modal loss function is defined as follows: m m X X 1X Ointra ¼ μi wij;k J vj  vk J 2F ¼ μi trðVLi V T Þ 2 i ¼ 1 j;k i¼1

ð9Þ

where μi denotes the weight of ith modality. Then the binary codes of training samples can be directly obtained by V. But it cannot be generalized to query straightly. SCH learns modality-specific linear mapping matrix for generating hashing code of each query, the objective function for this phase can be arranged as: Olmm ¼

m X

βi J V  T ðiÞ X ðiÞ J 2F þ η

i¼1

m X

J T ðiÞ J 2F

ð10Þ

i¼1

253

þ β i J V  T ðiÞ X ðiÞ J 2F þ η J T ðiÞ J 2F  trðγ ðiÞ U ðiÞ Þ  trðαÞ

ð13Þ

, αAR . where γ A R Following that the partial derivatives of L refer to U ðiÞ and V are set to 0. That is ðiÞ

∂L ∂U ðiÞT ∂L ∂V T

di p

pN

¼ λi ð  X ðiÞ V T þU ðiÞ VV T Þ  γ i ¼ 0

¼ λi ð  U ðiÞT X þ U ðiÞT U ðiÞ VÞ þ μi VL þ β i ðV  T ðiÞ X ðiÞ Þ  α ¼ 0

ð14Þ ð15Þ

According to the Karush–Kuhn–Tucker (KKT) constraint [32],

γ ðiÞ U ðiÞT ¼ 0 and αjk V jk ¼ 0, we can get: jk jk λi ð  X ðiÞ V þ U ðiÞ V T VÞjk U jk ¼ 0

ð16Þ

½λi ð  X ðiÞ U ðiÞT þU ðiÞT U ðiÞ VÞjk þ μi ðVLÞjk þ βi ðV T i X ðiÞ Þjk V jk ¼ 0

ð17Þ

Then we can obtain the updating rules as follows: ¼ U ðiÞ U ðiÞ j;k j;k

V j;k ¼ V j;k

ðX ðiÞ V T Þj;k ðU ðiÞ VV T Þj;k

ðλi U ðiÞT X ðiÞ þ μi VW þ β i T ðiÞ X ðiÞ Þj;k ðλi U ðiÞT U ðiÞ V þ μi VD þ βi VÞj;k

ð18Þ

ð19Þ

where T ðiÞ is the learned linear mapping matrix, βi and η are weight parameters. βi controls the weight for learning mapping matrix in each modality, and η controls the weight of regularization term. Then the overall objective function can be represented as follows:

Since the updating rules of T ðiÞ , U ðiÞ and V are nonincreasing, the objective function clearly has a lower bound. Furthermore, in experiments, we find that the objective function generally converges in a few iterations.

O ¼ arg min ðOinter þ Ointra þ Olmm Þ

3.3. Generating hashing codes

U ðiÞ ;V;T ðiÞ

¼ arg min ðiÞ

U ;V ;T

þ

m X

ðiÞ

m X

λi J X ðiÞ  U ðiÞ V J 2F þ

i¼1

m X

μi trðVLi V T Þ

i¼1

βi J V  T X ðiÞ

ðiÞ

J 2F þ

η

i¼1

m X

!

ð J T ðiÞ J 2F Þ

ð11Þ

Y ¼ sgnðV  V Þ

i¼1

s:t: U ðiÞ Z0; V Z 0. Hence both inter-modal and intra-modal semantic consistency are preserved in Eq. (11). Accordingly, richer information are exploited for learning the share semantic space, the semantic gap between different modalities can be effectively bridged. However, adding similarity preservation constraint into our model leads to higher complexity cost. In this paper, we propose an efficient algorithm to solve Eq. (11) in Section 3.2.

In this section, we propose an iterative method to solve Eq. (11). It is not convex about U ðiÞ , V and T ðiÞ jointly. So it is unrealistic to find the global minima in Eq. (11), while we can expect an algorithm to achieve local minima. In the following, we will give the solution of Eq. (11). 1. For Eq. (11), by fixing U ðiÞ and V, it is easy to solve T ðiÞ by the regularized least squares problem, which has a closed-form solution: T ðiÞ ¼ ðX ðiÞ X ðiÞT þ η=βi IÞ  1 VX ðiÞT

ð12Þ ðiÞ

2. With T is fixed, since U Z 0; V Z0, the problem of Eq. (11) can be solved by Lagrangian multiplier algorithm. The Lagrangian function is defined as follows: LðU ðiÞ ; VÞ ¼ arg min ðiÞ

m X

½λi J X ðiÞ  U ðiÞ V J 2F þ μi trðVLi V T Þ

U ;V i ¼ 1

ð20Þ

where sgn(  ) denotes the element-wise sign function. For test samples, hashing codes can be got by the learned mappings: ðiÞ

ðiÞ ðiÞ Y ðiÞ j ¼ sgnðT xj  b Þ

ð21Þ

ðiÞ

where b is a vector denoting the mean of mapped data. The overall procedure is summarized in Algorithm 1. Algorithm 1. Semantic consistency hashing. Input: N training samples X ¼ fX 1 ; X 2 …X N g with m modalities

3.2. Optimization algorithm

ðiÞ

The shared semantic space V can be obtained by the above algorithm. The next task is to generate the hashing code for each sample. For training samples, hashing codes can be generated by a simple thresholding strategy:

ð2Þ ðmÞ X i ¼ fX ð1Þ i ; X i ; …; X i g, and the labels of them l ¼ fl1 ; l2 ; …; lN g from q classes. 1: Compute the matrix of Wi,Di, Li for each modality, and

initialize the variable T ðiÞ ; U ðiÞ , V, threshold, Oold , O. 2: while O  Oold 4 threshold do 3: Oold ¼ O, 4:

Update T ðiÞ with fixing the other variables by Eq. (12),

5: Update U ðiÞ with fixing the others variables by Eq. (18), 6: Update V with fixing the other variables by Eq. (19), 7: Compute O by Eq. (11). 8: end while 9: Using Eq. (20) to gain the hashing codes of training samples. 10: Using Eq. (21) to obtain the hashing codes of test samples. 11: To rank the retrieved samples in response to query with hamming distance, then we can get the performance on mPA. Output: The performance on mPA.

254

T. Yao et al. / Neurocomputing 193 (2016) 250–259

Table 1 Performance comparison on Wiki dataset in terms of mAP for the task of text querying images. Task

Text to image

Methods

Code length (bits)

Table 2 Performance comparison on Wiki dataset in terms of mAP for the task of image querying texts. Task

p¼ 16

p ¼ 32

p ¼64

p ¼ 128

CCA [14] CRH [8] SCM [33] LSSH [18] CMFH [19] STMH [10]

0.4025 0.4367 0.4715 0.5346 0.5924 0.6086

0.3874 0.4601 0.4876 0.5672 0.6247 0.6296

0.3762 0.4532 0.4753 0.5723 0.6373 0.6429

0.3548 0.4521 0.4932 0.5684 0.6405 0.6537

SCH (K¼ 30) SCH (K¼ 80)

0.6486 0.6560

0.6654 0.6752

0.6893 0.6879

0.6903 0.6916

3.4. Complexity analysis We first analyze the training time complexity, the complexity of our method is mainly decided by the processes of constructing KNN graph and the updating scheme. The complexity for building KNN graph is O(NK) using Eq. (7) (where N is the size of training set, K is the number of neighbors of training samples used in the training phase) in each modality, which makes the algorithm more scalable. The complexity of updating scheme is relevant to the dimensionality di, the length of hashing codes p, the number of neighbors K, and the size of training set N. For our algorithm, it is important to observe that W is sparse. The number of nonzero elements of W in each column is K. Thus, we only need O(KpN) to compute VW. And D is a diagonal matrix, the complexity time for computing VD is O(pN). Hence, the computational cost of updating 3 T ðiÞ , U ðiÞ and V are Oðpdi N þ di N þ di Þ, Oðdi pN þ p2 N þ di p2 þ di pÞ, Oð2di pN þ p2 N þ p2 d þ pN þ KpNÞ, respectively. Typically, di and p is much less than N. Hence, the total time complexity of the training stage is reduced from traditional OðN 2 Þ or higher to O(N), the overall training time complexity becomes linear to the training dataset size. During the online procedure, with the precomputed mapping matrix T ðiÞ , the complexity for encoding a query to hashing code can be achieved in Oðpdi Þ. Hence, the query time complexity of our method is low too.

4. Experiments In this section, we evaluate the performance of the proposed algorithm on two public real world datasets. First, the datasets and evaluation criteria used in experiments are introduced, and then parameters setting and tuning are presented. At last, we conduct experimental comparisons with several existing approaches, including CCA [14], CRH [8], SCM [33], LSSH [18] and CMFH [19], STMH [10], and demonstrate the results. In experiments, two cross-modal retrieval tasks are conducted: (1) retrieving text dataset with image query, (2) retrieving image dataset with text query. 4.1. Datasets Two real world public datasets are widely used in cross-modal retrieval: Wiki dataset [34] and NUS-WIDE dataset [35]. The Wiki text-image dataset [34] is a notable public crossmodal dataset which is collected from the Wikipedia. It contains totally 2866 image–text pairs of documents and has 10 different

Image to text

Methods

Code length (bits) p ¼ 16

p ¼32

p¼ 64

p ¼128

CCA [14] CRH [8] SCM [33] LSSH [18] CMFH [19] STMH [10]

0.1458 0.1929 0.2343 0.2217 0.2472 0.2343

0.1372 0.1756 0.2378 0.2314 0.2547 0.2401

0.1201 0.1493 0.2401 0.2403 0.2583 0.2541

0.1240 0.1530 0.2439 0.2358 0.2610 0.2587

SCH (K¼30) SCH (K¼80)

0.2519 0.2557

0.2673 0.2719

0.2734 0.2786

0.2751 0.2812

semantic classes. The dataset is randomly divided into training set with 2173 documents and testing set with 693 documents. Each document consists of one image and together with more than 70 words. For the image modality, every image is represented by a 128-dimensional bag-of-words vector. For the text modality, the text in each document is represented by a 10-dimensional feature vector which is generated by latent Dirichlet allocation (LDA) [36]. Since the Wiki dataset is relatively small, to further objectively evaluate the performance and scalability of the proposed crossmodal retrieval scheme, NUS-WIDE dataset is also tested. NUS-WIDE is comprised of 269,648 images collected from Flickr, and each image associates with 6 tags in average [35]. Each image and its associated tags are taken as an image-text pair. These labelled images are manually annotated with 81 classes, and some images are multi-labeled in this dataset. Due to the distribution of each category is not even, a number of categories are scarce. To guarantee that each class has enough training samples, we choose those pairs belonging to 10 largest classes, and thus 186,577 pairs are chosen from 269,648 pairs. In this dataset, each image is represented by a 500-dimensional SIFT codebook vector, and each text document is represented by a 1000-dimensional tags codebook vector. 4.2. Baseline methods and evaluation scheme Baseline methods: We compare SCH with several existing works, which are taken as baseline methods in this paper. CCA [14] is a classical method for cross-modal retrieval, which finds a Hamming space by maximizing the correlation between different modalities. CRH [8] formulates to project samples far from 0 for learning hashing functions, then inter-modal similarity is effectively preserved. SCM [33] is a supervised hashing method. It integrates semantic labels into the hashing learning phase, and maximizes the semantic correlation between different modalities. LSSH [18] introduces sparse coding and matrix factorization to find a low-dimensional latent sematic space for text and image respectively, and learns a linear mapping matrix to bridge the semantic space between different modalities. CMFH [19] introduces collective matrix factorization to learn a consistent hamming space for each modality. STMH [10] formulates to project the learned multi-modal semantic features to a common subspace by their semantic correlations, then the hashing codes are generated by pointing out whether a topic is contained in a text or an image.

T. Yao et al. / Neurocomputing 193 (2016) 250–259

0.40

1.0 SCH CMFH CCA SCM CRH LSSH STMH

0.30 0.25

SCH CMFH CCA SCM CRH LSSH STMH

0.8

Precision

0.35

Precision

255

0.20 0.15

0.6 0.4 0.2

0.10 0.0

0.2

0.4

0.6

0.8

0.0 0.0

1.0

0.2

0.4

Recall

0.8

1.0

Recall

0.50

1.0 SCH CMFH CCA SCM CRH LSSH STMH

0.40 0.35 0.30

SCH CMFH CCA SCM CRH LSSH STMH

0.8

Precision

0.45

Precision

0.6

0.25

0.6 0.4

0.20 0.2

0.15 0.10 0.0

0.2

0.4

0.6

Recall

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Recall

Fig. 2. Precision–Recall curves of our SCH and compared methods for the cross-modal retrieval experiments on Wiki dataset with code length of 32 and 64 bits respectively. (a) and (c) show the results for the task of image querying texts. (b) and (d) show the results for the task of text querying images.

Fig. 3. One example of text-based image retrieval on Wiki dataset using SCH and CMFH proposed in [19]. The query image is chosen from the geography class. To compare the performance of two methods, the top 20 retrieved images in response to query are presented above. The top two rows of images are obtained by SCH, and the bottom two rows of images are obtained by CMFH. The image with green box denotes that the semantic concept is same as query, and the red box denotes different with query. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

256

T. Yao et al. / Neurocomputing 193 (2016) 250–259

Fig. 4. One example of image-based text retrieval on Wiki dataset by SCH and CMFH proposed in [19]. We use retrieved texts, its corresponding images and semantic concepts to demonstrate the results. And retrieved results irrelevant to the query in semantic concepts are marked with red box, retrieved results relevant to the query in semantic concepts are marked with green box. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Evaluation scheme: We evaluate the performance of our method with mAP (mean Average Precision) proposed in [15], which is used in almost multimedia retrieval algorithms. In addition, Precision–Recall curves are also reported on Wiki dataset. It shows the precision at different recall rates. 4.3. Experimental setting By default, we set λ1 ¼ 0:5; λ2 ¼ 0:5, μ1 ¼ 0:0013; μ2 ¼ 0:0025, β1 ¼ 1, β2 ¼ 1, η ¼ 0.05 in experiments. And we set K ¼ 30, K¼ 80

for Wiki dataset and K ¼50, K ¼200 for NUS-WIDE dataset respectively. The code length varies from p ¼16 to p¼ 128 in experiments. Since the initialization is selected randomly in experiments, we conduct the SCH 10 times, and average the results. Moreover, the parameter sensitivity will be discussed in next section. 4.4. Experimental results In experiments, two tasks are conducted, using an image query to retrieve texts and vice versa. It is easy to extend to more than two modalities. To verify effectiveness of our method, a set of experiments are conducted to compare with baseline algorithms. 4.4.1. Results on Wiki dataset 75% of samples are randomly chosen as training dataset and the remaining 25% are as test dataset. In addition we find that the proposed optimal algorithm needs only about 10–20 iterations leading to convergence in experiments. All the results reported in this paper are obtained by 15 iterations. We compare the proposed SCH with baseline methods using mAP with code length varying from 16 to 128 bits as listed in Tables 1 and 2. From the results shown in Tables 1 and 2, we have three observations as follows: (1) SCH method can significantly outperform the existing schemes in different bits on cross-modal retrieval tasks. The performance of our method outperforms baseline methods by 3.4–7.3% over

different code lengths for the task of image querying texts and by 8.0– 10.7% for the task of text querying images. The reasons are that NMF has the intrinsic power of effectively finding subspace for highdimensional feature, and not only inter-modal consistency but also intra-modal consistency are effectively preserved in the shared semantic space. (2) The results show an up trend of performance with the increase of code length. This is reasonable, because more information will be encoded in hashing codes as the code length increases. (3) Furthermore, we find that SCH outperforms baseline methods when K is relatively small. The possible reason is that the semantic consistency can be efficiently and effectively preserved by our method. Fig. 2 shows the Precision–Recall curve for SCH comparing with the baseline methods. From Fig. 2, we can see that SCH again outperforms baseline algorithms on the two tasks, which is similar to the results reported in above tables. Moreover, two examples of image querying texts and text querying images over Wiki dataset by SCH and CMFH are conducted. The results are shown in Figs. 3 and 4. For the example of using text to query images, the query text is about parks from the geography class. From the results we can observe that most of the retrieved results belong to the same class with the query. And the retrieved images conducted by SCH are more semantically correlated to the query than CMFH [19]. For the example of using image to query texts, the query image is about stadium from the sport class. The retrieved texts and their corresponding images are used to display the results. From Fig. 4 we find that the class of the third and fifth retrieved texts by SCH are different with query image, while the class of the second, fourth and fifth retrieved texts by CMFH are different with query image. It denotes that the performance of SCH is superior to CMFH. Furthermore, the results conducted by SCH are more visually consistent with the query image than CMFH. In addition, a set of experiments are conducted with different parameters to test the parameter sensitivity. We utilize crossvalidation to tune parameters. At first, the effect of K are conducted. We set K ¼ ½10; 20; 30; 50; 80; 100 respectively, and code

T. Yao et al. / Neurocomputing 193 (2016) 250–259

0.29

257

0.70

0.28 0.68

0.26

K=100 K=80 K=50 K=30 K=20 K=10

0.25 0.24 0.23

mAP

mAP

0.27 0.66

K=100 K=80 K=50 K=30 p=20 p=10

0.64

0.62 0

16

32

48 64 80 96 Code length (bits)

112

128

0

16

32

48 64 80 96 Code length (bits)

112

128

Fig. 5. mAP comparison on Wiki dataset with the parameter K varying from 10 to 100 and code length varying from 16 to 128 bits. (a) The results for the task of image querying texts. (b) The results for the task of text querying images.

p=128 p=64 p=32 p=16

p=128 p=64 p=32 p=16

0.68

mAP

mAP

0.28

0.24

0.64

0.20 1E-4

1E-3

0.01

0.1

0.60 1E-4

1

1E-3

0.01

0.1

μ

μ

1

Fig. 6. Performance on Wiki dataset by varying μ from 0.0001 to 1. (a) The results for the task of image querying texts. (b) The results for the task of text querying images.

0.80

0.80 Image Text

Text Image

0.70

0.65

0.65

0.60

0.60

0.55 0.50 0.45

0.55 0.50 0.45

0.40

0.40

0.35

0.35

0.30

0.30

0.25

0.0

Image Text Text Image

0.75

0.70

Precision

Precision

0.75

0.25 0.1

0.2

0.3

0.5

1

2

3

5

10

β

0.0001 0.0005 0.001 0.005

0.01 0.05

η

0.1

0.2

0.5

1

Fig. 7. The performance on Wiki dataset using 32 bits hashing code. (a) The performance by varying β from 0 to 10. (b) The performance by varying η from 0.0001 to 1.

length varies from 16 to 128 bits in experiments. The results are shown in Fig. 5. From Fig. 5, we can observe that a larger value of K leads to better results, since more information can be exploited with the increase of K in training phase. However, the training cost increases when K increases. Fortunately, we find that a relatively small K value can achieve promising results, and SCH is not

sensitive to the value of K when K 4 80. This allows SCH much more scalable than existing algorithms. Then we test the effect of μ, we set μ varying from 0.0001 to 1 with different hashing bits. The experimental results are shown in Fig. 6. From the results we find that the performance is only mildly sensitive to this parameter. At last, in experiments, we find that the results are not

258

T. Yao et al. / Neurocomputing 193 (2016) 250–259

sensitive to β and η, SCH can achieve promising results when β A ½0:3; 5 and η A ½0:0005; 0:2 as shown in Fig. 7. 4.4.2. Results on NUS-WIDE dataset Whereas CRH and LSSH require too much resource to learn hashing functions on NUS-WIDE dataset. We randomly select 20,000 pairs to train hashing functions and then they are utilized to generate hashing codes for the other samples. The performance of SCH and baseline schemes are reported in Tables 3 and 4. From the results, we find that SCH achieves promising performance on mAP. In more detail, SCH improves the performance to a maximum of 16.48% compared to baseline methods.

Table 3 Performance comparison on NUS-WIDE dataset in terms of mAP for the task of text querying images. Task

Methods

Text to image

Code length (bits) p ¼ 16

p ¼ 32

p ¼64

p ¼ 128

CCA [14] CRH [8] SCM [33] LSSH [18] CMFH [19] STMH [10]

0.4362 0.4792 0.5348 0.6217 0.6593 0.6757

0.4209 0.5077 0.5422 0.6438 0.6834 0.6968

0.4235 0.4837 0.5736 0.6902 0.7105 0.7244

0.4147 0.5024 0.6017 0.6916 0.7189 0.7314

SCH (K¼ 50) SCH (K¼ 200)

0.6847 0.6954

0.7196 0.7406

0.7552 0.7975

0.7602 0.8244

Table 4 Performance comparison on NUS-WIDE dataset in terms of mAP for the task of image querying texts. Task

Methods

Image to text

Code length (bits) p ¼16

p ¼32

p ¼64

p ¼ 128

CCA [14] CRH [8] SCM [33] LSSH [18] CMFH [19] STMH [10]

0.3912 0.4123 0.4842 0.4934 0.5527 0.5706

0.3846 0.4056 0.4715 0.5043 0.5612 0.5975

0.3901 0.4218 0.5283 0.5182 0.5751 0.6048

0.3749 0.4015 0.5137 0.5208 0.5993 0.6319

SCH (K¼50) SCH (K¼200)

0.6101 0.6167

0.6167 0.6450

0.6320 0.7045

0.6472 0.7261

Furthermore, a set of experiments are conducted about K with code length varying from 16 to 128 bits, and the experimental results are shown in Fig. 8(a) and (b). From Fig. 8, we can observe that similar to Wiki dataset, a relatively small value of K can achieve promising results, and SCH is not sensitive to the value of K when K 4 200 on NUS-WIDE dataset. For investigating the scalability of different schemes, we compare the training time complexity with CMFH proposed in [19] by varying the size of training set from 5000 to 50,000 with 32 bits hashing code. The experiment is performed on a server with Intel (R) Xeon(R) CPU E5-2650 [email protected] GHz and 128 GB memory. The results are shown in Table 5. From Table 5, we can observe that the training time cost of CMFH is lower than our SCH in small training set. The reason is that some items of the time cost cannot be ignored when the size of training set is small. While the training time cost of CMFH is much higher than our SCH in relatively large training set. Hence, our SCH can be efficiently utilized in largescale applications.

5. Conclusions In this paper, we propose a Semantic Consistency Hashing (SCH) method for efficient similarity search over large-scale heterogeneous dataset. In particular, through leveraging both intermodal and intra-modal semantic consistency, hashing functions are learned for different modalities. Then an iterative updating scheme is applied to efficiently derive local optimal solutions. The time complexity is reduced to O(N). In experiments, the results gain significant improvement compared with the existing crossmodal retrieval methods. Our future works include utilizing sophisticated learning schemes such as kernel learning to further improve the performance with an affordable training cost. Our scheme can be easily extended to image classification, multi-label learning, annotation, and other learning domains. Table 5 The training time cost (in seconds) on NUS-WIDE dataset by varying the size of training set. Methods/Size of training set

5000

10,000

20,000

50,000

SCH CMFH [19]

283.52 91.40

294.75 168.55

316.38 459.53

352.17 1792.93

0.76 K=400 K=200 K=100 K=50 K=30

K=400 K=200 K=100 K=50 K=30

0.80

0.68 mAP

mAP

0.72

0.84

0.64

0.76 0.72 0.68

0.60

0

16

32

48 64 80 96 Code length (bits)

112 128

0.64

0

16

32

48 64 80 Code length (bits)

96

112

128

Fig. 8. mAP comparison on NUS-WIDE dataset with the parameter K varying from 30 to 400 and code length varying from 16 to 128 bits. (a) The task of image querying texts. (b) The task of text querying images.

T. Yao et al. / Neurocomputing 193 (2016) 250–259

Acknowledgment This work is supported by the Foundation for Innovative Research Groups of the NSFC (Grant no. 71421001), National Natural Science Foundation of China (Grant nos. 61502073, 61172109), the Fundamental Research Funds for the Central Universities (No. DUT14QY03) and the Open Projects Program of National Laboratory of Pattern Recognition (No. 201407349).

References [1] B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search, In: IEEE International Conference on Computer Vision, 2009, pp. 2130–2137. [2] L. Zhang, Y. Zhang, X. Gu, J. Tang, Q. Tian, Scalable similarity search with topology preserving hashing, IEEE Trans. Image Process. 23 (7) (2014) 3025–3039. [3] H. Fu, X. Kong, J. Lu, Large-scale image retrieval based on boosting iterative quantization hashing with query-adaptive reranking, Neurocomputing 122 (2013) 480–489. [4] J. Ji, S. Yan, J. Li, G. Gao, Q. Tian, B. Zhang, Batch-orthogonal locality-sensitive hashing for angular similarity, IEEE Trans. Pattern Anal. Mach. Intell. 36 (10) (2014) 1963–1974. [5] Z. Bodó, L. Csató, Linear spectral hashing, Neurocomputing 141 (2) (2014) 117–123. [6] Y. Pan, T. Yao, H. Li, C.-W. Ngo, T. Mei, Semi-supervised hashing with semantic confidence for large scale visual search, ACM Spec. Interest Group Inf. Retr. (2015) 53–62. [7] J.C. Caicedo, F.A. González, Multimodal fusion for image retrieval using matrix factorization, In: ACM International Conference on Multimedia Retrieval, 2012, pp. 56–63. [8] Y. Zhen, D.-Y. Yeung, Co-regularized hashing for multimodal data, Neural Inf. Process. Syst. (2012) 1376–1384. [9] Z. Lin, G. Ding, M. Hu, J. Wang, Semantics-preserving hashing for cross-view retrieval, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3864–3872. [10] D. Wang, X. Gao, X. Wang, L. He, Semantic topic multimodal hashing for crossmedia retrieval, In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3890–3896. [11] B. Wu, Q. Yang, W.-S. Zheng, Y. Wang, J. Wang, Quantized correlation hashing for fast cross-modal search, In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI, 2015. [12] J. Song, Y. Yang, Y. Yang, Z. Huang, H. Shen, Inter–media hashing for large-scale retrieval from heterogenous data sources, In: ACM International Conference on Management of Data, 2013, pp. 785–796. [13] T. Mei, Y. Rui, S. Li, Q. Tian, Multimedia search reranking: a literature survey, ACM Comput. Surv. (CSUR) 46 (3) (2014) 38. [14] Y. Gong, S. Lazebnik, Iterative quantization: a procrustean approach to learning binary codes, In: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 817–824. [15] Y. Zhen, D.-Y. Yeung, A probabilistic model for multimodal hash function learning, In: ACM Conference on Knowledge Discovery and Data Mining, 2012, pp. 940–948. [16] J. Masci, M.M. Bronstein, A. Bronstein, J. Schmidhuber, Multimodal similaritypreserving hashing, IEEE Trans. Pattern Anal. Mach. Intell. 36 (4) (2014) 824–830. [17] Y.T. Zhuang, Y.F. Wang, F. Wu, Y. Zhang, W.M. Lu, Supervised coupled dictionary learning with group structures for multi-modal retrieval, In: AAAI Conference on Artificial Intelligence, 2013, pp. 1070–1076. [18] J. Zhou, G. Ding, Y. Guo, Latent semantic sparse hashing for cross-modal similarity search, ACM Spec. Interest Group Inf. Retr. (2014) 415–424. [19] G. Ding, Y. Guo, J. Zhou, Collective matrix factorization hashing for multimodal data, In: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2083–2090. [20] X. He, P. Niyogi, Locality preserving projections, In: International Conference on Machine Learning, vol. 16, 2003, pp. 153–160. [21] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323–2326. [22] D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999) 788–791. [23] D. Cai, X. He, X. Wu, J. Han, Non-negative matrix factorization on manifold, In: International Conference on Data Mining, 2008, pp. 63–72. [24] M. Heiler, C. Schnörr, Learning sparse representations by non-negative matrix factorization and sequential cone programming, J. Mach. Learn. Res. 7 (2006) 1385–1407. [25] W. Xu, X. Liu, Y. Gong, Document clustering based on non-negative matrix factorization, ACM Spec. Interest Group Inf. Retr. (2003) 267–273. [26] F. Shang, L. Jiao, F. Wang, Graph dual regularization non-negative matrix factorization for co-clustering, Pattern Recognit. 45 (6) (2012) 2237–2250. [27] Y. Yang, H. T. Shen, F. Nie, R. Ji, X. Zhou, Nonnegative spectral clustering with discriminative regularization, In: AAAI Conference on Artificial Intelligence, 2011, pp. 555–560.

259

[28] F. Pompili, N. Gillis, P.-A. Absil, F. Glineur, Two algorithms for orthogonal nonnegative matrix factorization with application to clustering, Neurocomputing 141 (2014) 15–25. [29] A. Kumar, V. Sindhwani, P. Kambadur, Fast conical hull algorithms for nearseparable non-negative matrix factorization, In: International Conference on Machine Learning, 2013, pp. 231–239. [30] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, Neural Inf. Process. Syst. (2000) 556–562. [31] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res. 7 (2006) 2399–2434. [32] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, New York, 2004. [33] D. Zhang, W.-J. Li, Large-scale supervised multimodal hashing with semantic correlation maximization, In: AAAI Conference on Artificial Intelligence, 2014, pp. 2177–2183. [34] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G.R. Lanckriet, R. Levy, N. Vasconcelos, A new approach to cross-modal multimedia retrieval, ACM Multimed. (2010) 160–251. [35] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, Nus-wide: a real-world web image database from National University of Singapore, In: ACM Conference on Image and Video Retrieval, 2009, pp. 48–56. [36] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, Neural Inf. Process. Syst. 3 (2003) 993–1022.

Tao Yao received his Master degree from Wuhan University of Technology, China, in 2006. Currently, he is seeking his Ph.D. degree in School of Information and Communication Engineering at Dalian University of Technology, China. From 2006 to now, he works as a Lecturer in School of Information and Electrical Engineering at Ludong University, China. His research interests include multimedia retrieval, computer vision and machine learning.

Xiangwei Kong received her Ph.D. degree in Management Science and Engineering from Dalian University of Technology, China, in 2003. From 2006 to 2007, she was a visiting researcher in Department of Computer Science at Purdue University, USA. She is currently a professor in the School of Information and Communication Engineering at Dalian University of Technology, China. Her research interests include digital image processing and recognition, multimedia information security, digital media forensics, image retrieval and mining, multisource information fusion, knowledge management and business intelligence.

Haiyan Fu received her Ph.D. from Dalian University of Technology, China, in 2014. She is currently an associate professor in the School of Information and Communication Engineering at Dalian University of Technology, China. Her research interests are in the areas of image retrieval and computer vision.

Qi Tian received the B.E. degree in electronic engineering from Tsinghua University, China, in 1992, the M.S. degree in electrical and computer engineering from Drexel University in 1996 and the Ph.D. degree in electrical and computer engineering from the University of Illinois, Urbana-Champaign in 2002. He is currently a Professor in the Department of Computer Science at the University of Texas at San Antonio (UTSA). He took a one-year faculty leave at Microsoft Research Asia (MSRA) during 2008– 2009. His research interests include multimedia information retrieval and computer vision.