Global and local semantics-preserving based deep hashing for cross-modal retrieval

Global and local semantics-preserving based deep hashing for cross-modal retrieval

Communicated by Dr. Deng Cheng Accepted Manuscript Global and Local Semantics-Preserving Based Deep Hashing for Cross-Modal Retrieval Lei Ma, Hongli...

2MB Sizes 0 Downloads 42 Views

Communicated by Dr. Deng Cheng

Accepted Manuscript

Global and Local Semantics-Preserving Based Deep Hashing for Cross-Modal Retrieval Lei Ma, Hongliang Li, Fanman Meng, Qingbo Wu, King Ngi Ngan PII: DOI: Reference:

S0925-2312(18)30635-0 10.1016/j.neucom.2018.05.052 NEUCOM 19613

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

17 October 2017 3 May 2018 17 May 2018

Please cite this article as: Lei Ma, Hongliang Li, Fanman Meng, Qingbo Wu, King Ngi Ngan, Global and Local Semantics-Preserving Based Deep Hashing for Cross-Modal Retrieval, Neurocomputing (2018), doi: 10.1016/j.neucom.2018.05.052

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Global and Local Semantics-Preserving Based Deep Hashing for Cross-Modal Retrieval

a School

CR IP T

Lei Maa,∗∗, Hongliang Lia,∗∗, Fanman Menga , Qingbo Wua , King Ngi Ngana of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, PR China

Abstract

AN US

Cross-modal hashing methods map similar data entities from heterogeneous data

sources to binary codes with smaller Hamming distance. However, most existing cross-modal hashing methods learn the hash codes with the hand-crafted features which will not generate optimal hash codes and achieve satisfactory performance. Deep cross-modal hashing methods integrate feature learning and hash coding into an end-to-end learning framework which have achieved promising

M

results. However, these deep cross-modal hashing methods do not well preserve the discriminative ability and the global multilevel similarity in hash learning

ED

procedure. In this paper, we propose a global and local semantics-preserving based deep hashing method for cross-modal retrieval. More specifically, a large margin is enforced between similar hash codes and dissimilar hash codes from

PT

an inter-modal view to learn discriminative hash codes. Therefore the learned hash codes can well preserve local semantic structure. Sequently, the supervised information with the global multilevel similarity is introduced to learn

CE

semantics-preserving hash codes for each intra-modal view. As a consequence, the global semantic structure can be preserved into the hash codes. Further-

AC

more, a consistent regularization constraint is added to generate unified hash codes. Ultimately, the feature learning procedure and the hash coding procedure ∗ Corresponding

author author Email addresses: leima [email protected] (Lei Ma), [email protected] (Hongliang Li) URL: [email protected] (Fanman Meng), [email protected] (Qingbo Wu), [email protected] (King Ngi Ngan) ∗∗ Corresponding

Preprint submitted to NEUROCOMPUTING

May 25, 2018

ACCEPTED MANUSCRIPT

are integrated into an end-to-end learning framework. To verify the effectiveness of the proposed method, extensive experiments are conducted on several datasets, and the experimental results demonstrate that the proposed method

CR IP T

achieves superior performance. Keywords: Deep learning, metric learning, semantic preserving, cross-modal hashing.

1. Introduction

AN US

Nowadays the dramatic growth of multimedia data from internet click and mobile device usage have posed a great challenge to data storage and information indexing. With benefits of high retrieval efficiency and low storage cost, hashing 5

technology has attracted growing attention during the past few years. Hashing methods aim to map data points to compact binary codes while preserving the similarity in the original space. The storage cost can be reduced dramatically

M

with the binary representation, and the const or sub-linear search speed can be achieved with the fast Hamming distances computation (bit-wise XOR operator 10

and bit-count operations).

ED

Most learning to hash methods focus on unimodal hashing which requires queries and database entries to have a homogeneous feature space [1, 2, 3, 4, 5,

PT

6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. However, in real-world scenarios, data are often composed of multiple modalities. For example, an image on a website is usually 15

surrounded by some tags, textual descriptions and other contents. Cross-modal

CE

retrieval aims to query a data in one modality and retrieve relevant data in another modality. To improve the retrieval efficiency and speed, several cross-

AC

modal hashing methods have been widely explored.

20

Existing cross-modal hashing methods can be roughly divided into two main

sub-categories: unsupervised methods and supervised methods. The unsupervised methods only utilize the original multi-modal features to learn hash functions, and the representative methods include CHMIS [16], SM2 H [17], LSSH [18], CMFH [19], ACQ [20], STMH [21], MGCMH [22], CCQ [23] and CV-DMH

2

ACCEPTED MANUSCRIPT

[24]. The basic idea of these unsupervised cross-modal hashing methods is to dis25

cover the latent correlation of different modalities. The widely used techniques include canonical correlation analysis [20], manifold learning [16, 17, 21, 22, 24],

CR IP T

dictionary learning [17, 18], matrix factorization [18, 19, 21, 24], vector quantization [23]. Due to lack of support of manual annotations, the unsupervised cross-modal hashing methods usually fail to achieve satisfactory performance.

For supervised cross-modal hashing methods, the semantic labels are in-

30

corporated into learning hash functions to improve the cross-modal retrieval

performance. The representative supervised cross-modal hashing methods in-

AN US

clude: 1) traditional cross-modal hashing methods such as CH [25], SCM [26], QCH [27], SMFH [28], MDBE [29], SePH [30], RoPH [31], DCH [32], LRSH [33], 35

CSDH [34]; 2) deep cross-modal hashing methods such as DCMH [35], PRDH [36], DVSH [37], CDQ [38], CHN [39], DSH [40], TDH [41]. The supervised cross-modal hashing methods incorporate the manual annotations in the loss function to model the semantic correlations across different modalities.

40

M

For traditional supervised cross-modal hashing methods, the entire learning procedure can be separated into two independent steps, i.e., the feature

ED

extraction and the hash-code learning. Although these methods can achieve promising retrieval performance, they usually fail to generate optimal hash codes. The reasons can be two-fold: 1) their feature extraction procedure is

45

PT

independent of the hash-code learning procedure; 2) the hash functions to be learned are usually linear which cannot effectively preserve the structure of the

CE

learned hash codes. On the contrary, the deep cross-modal hashing methods integrate the feature learning procedure and hash-code learning procedure into an end-to-end learning framework with deep neural networks. In addition, the

AC

nonlinear activations for deep neural networks in theory can be helpful to fit

50

arbitrary complex functions. Generally speaking, the end-to-end deep learning architecture for cross-modal hashing consists of two neural networks with one for each modality. The image modality usually utilizes the convolutional neural network (CNN) [35, 36, 37, 38] and the text modality usually adopts the multilayer perceptron (MLP) [35, 36, 38] or recurrent neural network (RNN) [37]. 3

ACCEPTED MANUSCRIPT

55

This end-to-end deep learning architecture can map the images and texts into a common Hamming space in which the cross-modal retrieval is conducted. Existing deep cross-modal hashing methods treat semantic similar instances equally

CR IP T

while ignore the different similarity levels between them in the case of multiple labels, where the similarity level is relevant to the number of common labels 60

shared by pairwise similar instances. Consequently, the above deep cross-modal

hashing methods cannot well deal with the multilevel measure (such as very similar, normally similar and dissimilar) [42] based multi-label cross-modal retrieval task. In addition, existing deep cross-modal hashing methods can only

65

AN US

capture the local semantic structure while ignore the global semantic structure. Obviously, the local semantic structure is insufficient to represent the semantic structure of cross-modal data.

To enhance the discriminative ability while preserving the global multilevel semantic structure of hash codes, we propose a global and local semanticspreserving (GLSP) based deep hashing method for cross-modal retrieval. The framework of the proposed method is shown in Fig. 1. More specifically, we

M

70

utilize a metric learning method based on Hamming distance to preserve a

ED

large margin between similar data points and dissimilar data points across different modalities. Through the metric learning method, we can preserve the local semantic structure into the hash codes across different modalities. In addition, in order to preserve the global multilevel semantic structure in hash

PT

75

codes, we transform the similarity levels into a probability distribution and

CE

utilize this probability distribution to supervise the intra-modal hash learning. Subsequently, in order to further enhance the performance of cross-modal retrieval, a consistent regularization term is introduced to guarantee that the resulting hash codes are as consistent as possible across different modalities.

AC

80

Finally, we integrate the local inter-modal semantic structure preserving term, the global intra-modal semantic structure preserving term and the consistent regularization term into an end-to-end learning framework. The entire deep network model is jointly trained via an efficient optimization algorithm. Exten-

85

sive experiments conducted on various cross-modal datasets and comparisons 4

ACCEPTED MANUSCRIPT

Local Semantic Structure Preserving

Input

ʹʹͶ

ͳͳ

ͳͳ

ʹʹͶ

ͷͶ

ͷͶ

conv2

ͷ ͷ

ʹ͹ ʹ͹

conv3

͵ ͵

Conv 5h5 /1 ͵ 11h11 /4 Pooling Pooling 3×3 /2 3×3 /2

Conv ͸Ͷ

͸Ͷ

ʹ͹ ʹ͹

Conv 3h3 /1 Pooling 3×3 /2

͵ ͵

conv4 ͳ͵

ͳ͵

͵ ͵

ʹͷ͸ Conv ʹͷ͸ 3h3 /1 Pooling 3×3 /2

1) a yellow building with white columns in the background; 2) two palm trees in front of the house; 3) cars are parking in front of the house; 4) a woman and a child are walking over the square;

1

µ1−τ1

͸

1

µ1+τ1

-1

͸

ʹͷ͸ Conv 3h3 /1 Pooling 3×3 /2

L 4096

4096

Input

fc1

...... bow

2τ1

fcb

fcb

word2vec

Consistent Regularization

fc7

......

L 8192

Text Feature Extraction

-1

similar cross-modal data points

dissimilar cross-modal data points

Global Semantic Structure Preserving

AN US

1) the dark outline of trees and bushes in the foreground; 2) the setting sun and an orange sky in the background;

fc6

conv5

CR IP T

conv1

Figure 1: Overview of the proposed global and local semantics-preserving based deep hashing method for cross-modal retrieval. The entire network architecture constitutes two encoders including an image encoder and a text encoder. We integrate the local semantic structure preserving (metric learning), global semantic structure preserving and a consistent regularization term into an end-to-end learning framework to improve the discriminative ability of hash

M

codes while preserving the global multilevel semantic structure. In addition, we use different shapes and colors to represent dissimilar data points and data points from different modalities respectively. P, Qx , and Qy are the joint probability distribution from the ground-truth more details.

ED

labels, image hash codes, and text hash codes respectively. Please refer to the text part for

PT

with state-of-the-art approaches demonstrate the effectiveness of the proposed method.

The rest of this paper is organized as follows. We review related works

CE

in section 2. The proposed method is elaborated in section 3, including the

90

deep architecture, the local semantic similarity preserving for capturing intermodal correlations, the global semantic similarity preserving for capturing intra-

AC

modal correlations and the regularization loss. The optimization algorithm is introduced in section 4. Extensive experimental evaluations are provided in section 5. Finally, we draw a conclusion in section 6.

5

ACCEPTED MANUSCRIPT

95

2. Related Works Recently, a variety of cross-modal retrieval methods have been proposed, including real-valued cross-modal retrieval methods, quantization-based cross-

survey on cross-modal retrieval can be found in [43]. 100

2.1. Real-Valued Cross-Modal Retrieval Methods

CR IP T

modal retrieval methods and cross-modal hashing methods. A comprehensive

The key idea of real-valued cross-modal retrieval is to map multimedia data

into a common subspace where cross-modal multimedia retrieval can be per-

AN US

formed. Recently, many real-valued cross-modal retrieval methods have been proposed. For example, Rasiwasia et al. [44] learn a latent space between two 105

modalities with canonical correlation analysis (CCA). Deng et al. [45] learn the cross-modal embedding space based on discriminative dictionary learning with common label alignment. Liong et al. [46] design two neural networks to nonlinearly map the multimedia data from two different modalities into a shared

110

M

feature subspace. However, these methods often require substantial memory resources for cross-modal retrieval and the low query efficiency limits their scal-

ED

ability. Therefore, they are not suitable for large-scale search. 2.2. Quantization-Based Cross-Modal Retrieval Methods

PT

Recently, quantization techniques [47, 48, 49] use some quantizers to approximate original features and obtain good results for single-modal similarity retrieval task [50, 51, 52]. Consequently, some quantization based cross-modal

CE

115

hashing methods have been proposed. For example, Irie [20], Xie [22], Liu [25], Wu [27] and Jiang [35] learn hash functions and the hash codes via minimizing

AC

the binary quantization error. Long et al. [23] propose a composite correla-

tion quantization model to preserve both intra-modal similarity and inter-modal

120

correlation. Zhang et al. [53] learn the quantizers for both modalities jointly through aligning the quantized representations for each pair of image and text in the same article. Yang [54] and Cao [38] propose deep quantization methods for cross-modal similarity retrieval, which adopt additive quantization [48] 6

ACCEPTED MANUSCRIPT

with label alignment and composite quantization [49] respectively to learn the 125

quantizers for both modalities using two neural networks. However, these quantization based cross-modal hashing methods [23, 53, 54, 38] tend to be slightly

CR IP T

less efficient than Hamming search over binary codes [55]. 2.3. Cross-Modal Hashing Methods

Unsupervised cross-modal hashing methods [17, 19, 22, 56, 57] learn the hash 130

codes using the original multi-modal features without any supervision informa-

tion. For example, Wu et al. [17] learn sparse codesets by jointly optimizing the

AN US

multi-modal dictionary, sparse representations and hyper-graph based manifold regularization. Ding et al. [19] employ collective matrix factorization to lean unified binary codes of different modalities. Xie et al. [22] learn unified hash 135

codes by constructing multiple homogeneous manifolds, one for each modality. Liu et al. [56] propose a query-adaptive rank fusion method over multiple tables with multiple views. In real applications, unsupervised cross-modal hashing

M

methods can hardly achieve satisfactory retrieval performance.

Unlike unsupervised cross-modal hashing methods, supervised cross-modal hashing methods utilize supervised information to facilitate hash function learn-

ED

140

ing or hash-code learning. The supervised information for modeling semantic correlations in existing cross-modal hashing can be given in four different forms

PT

include point-wise supervision [29, 32], pair-wise supervision [25, 26, 27, 28, 33, 34, 35, 36, 37, 38, 39, 40], triplet-wise supervision [31] and structured supervision 145

[30].

CE

The point-wise supervision usually formulates the hash learning as a linear

classification problem. For example, Wang et al. [29] simultaneously learn hash

AC

codes for supervised classification and preserves category-specific features for hash codes. Xu et al. [32] learn the modal-specific hash functions via the ridge

150

penalty and generates unified hash codes for linear classification. Although the semantic labels are introduced, the semantic correlation between different instances is not fully exploited. Therefore, the semantic correlation will not be effectively reflected in the resulting hash codes. 7

ACCEPTED MANUSCRIPT

The pair-wise supervision usually formulates the hash learning as a pair155

wise fitting problem [26, 30, 31, 40], metric learning problem [28, 27, 37, 39] and pair-wise classification problem [33, 34, 35, 36, 38]. The pair-wise fitting

CR IP T

problem aims to fit the groundtruth similarities with the similarities of hash codes. For example, Zhang [26] and Liu [40] fit the semantic similarity matrix with the learned hash codes based on the minimum mean square error. The 160

metric learning problem aims to preserve the distance between similar hash

codes close and dissimilar hash codes far away in the Hamming space, where different distance metrics can be used such as Hamming distance, Euclidean

AN US

distance, cosine distance and so on. For example, Liu et al. [28] preserve the semantic similarities of hash codes based on Euclidean distance. Wu [27] 165

and Cao [37, 39] utilize the cosine distance between hash codes to measure the similarity in the Hamming space. The pair-wise classification problem aims to minimize the pair-wise classification error. For example, Liu et al. [34] learn hash functions by minimizing the weighted classification error for each bit with

170

M

a boosting strategy. Jiang [35], Yang [36], and Cao [38] maximize cross-modal correlations by optimizing the likelihood function based on pairwise constraints.

ED

The pair-wise supervised methods usually consider the local semantic structure, since they mainly focus on modeling the relationship between individual training instances. Therefore the global semantic structure such as the ranking relation

175

PT

and the distribution of the semantic relationships between training instances are ignored.

CE

The triplet-wise supervision [31, 41] usually formulates the hash learning as a ranking preserving problem. For example, Ding et al. [31] formulate the problem of triplet ranking in Hamming space as the binary regression problem.

AC

Structured supervision [30] formulates the hash learning as the preservation of

180

the structure of semantic similarity distribution. For example, Lin et al. [30] transform the semantic affinities into a probability distribution, and fits it with another one derived from pair-wise Hamming distances by minimizing their Kullback-Leibler divergence. Compared to the triplet supervision, structured supervision can sufficiently capture the distribution of the semantic relation8

ACCEPTED MANUSCRIPT

185

ships between training instances by introducing the global semantic structure. However, the distribution of the semantic relationships between training instances cannot completely reveal the local discriminating semantic structure of

CR IP T

the multimedia data. In this paper, we incorporate the local semantic structure and the global se190

mantic structure into the cross-modal hash learning. In addition, inspired by the

success of quantization-based hashing methods, we introduce a consistent quantization loss to generate unified hash codes for both modalities of any instance. Finally, the local inter-modal semantic structure preserving term, the global

195

AN US

intra-modal semantic structure preserving term and the consistent regularization term are integrated into an end-to-end learning framework. Experimental results demonstrate the promising effectiveness and efficiency of the proposed GLSP.

M

3. Proposed Method

3.1. Notation and Problem Formulation In this paper, the bold uppercase letter like S and bold lowercase letter like

ED

200

z are used to denote matrix and vector. Sij denotes the element at the position (i, j) of matrix S. Si∗ denotes the elements of matrix S at the i-th row, and S∗j

PT

denotes the elements of matrix S at the j-th column. Moreover, kSkF is used to denote the Frobenius norm of matrix S. sign(·) denotes an element-wise sign 205

function where sign(x) = 1 if x ≥ 0 and sign(x) = −1 otherwise.

CE

Assume that there are n pairs of image-text instances O = {oi }ni=1 , and

each instance oi = (xi , yi ) has a class label vector zi = [zi1 , zi2 , · · · , zil ] ∈ Rc×1 ,

AC

where the set X = {xi }ni=1 denotes the image modality, Y = {yi }ni=1 denotes

the text modality, and c is the number of categories. zij is equal to 1 if the j-th

210

label is relevant to the instance oi and 0 otherwise. In addition, we also define a similarity matrix S ∈ {0, 1}n×n of training instances where Sij = 1 if the

instances oi and oj are similar, and Sij = 0 otherwise. Here, the definition of similar or dissimilar instances is based on their class labels {zi }ni=1 . Specifically, 9

ACCEPTED MANUSCRIPT

if two instances oi and oj share at least one label, we can say that they are 215

similar. Otherwise, oi and oj are dissimilar. The similarity matrix S also builds the neighbor graph between two instances in the semantic space. In other words,

neighbors of oi .

CR IP T

oj is the k-nearest neighbors of oi if Sij = 1. Otherwise, oj is not the k-nearest The goal of the proposed method is to learn an image hash function hx (x) ∈

{−1, 1}L and a text hash function hy (y) ∈ {−1, 1}L , where L is the length of (x)

(y)

= h(x) (xi ) and bi

hash code. The binary codes bi

= h(y) (yi ) can be gen-

erated for query and database instances with the corresponding hash function. (x)

(x)

(y)

dH (bi , bj ) =

(y)

and bj

is defined as follows:

AN US

The Hamming distance between bi

1 (x) L 1 (x) T (y) (y) kb − bj k2F = − bi bj 4 i 2 2

(1)

The proposed cross-modal hashing method is based on two assumptions: 1) if inter-modal hash codes are discriminative enough, similar hash codes and dissimilar hash codes from an inter-modal view will be separated by a large

M

margin; 2) if intra-modal hash codes can preserve the structure of the semantic space well, the difference between the probability distribution in the semantic

ED

space and the probability distribution of each modality in the Hamming space will be small. Based on the above assumptions, the entire objective function

PT

can be given as follows:

J = J xy + αJ xx + βJ yy + γR

(2)

CE

where J xy is the local semantic structure preserving term for capturing inter-

220

modal correlations, J xx and J yy are the global semantic structure preserving

term for capturing intra-modal correlations, and R is a regularization term to

AC

generate the unified hash codes for both modalities. α, β and γ are tradeoff hyperparameters to balance the impacts of each term. 3.2. Deep Architecture

225

The deep architecture contains two encoders, i.e., an image encoder and a text encoder, which are same as the ones in [35]. The image encoder consists of 10

ACCEPTED MANUSCRIPT

eight layers the VGG-F [58] network where the first seven layers are from the VGG-F model including conv1 − conv5 and f c6 − f c7, the eight layer is a fullyconnected layer of L dimension. The text encoder is a multilayer perceptrons (MLP) which consists of two fully-connected layers with the output dimensions

CR IP T

230

of 8192 and L respectively. The activation functions for the first layer and the

second layer are ReLU and the identity function respectively. Through these

two deep neural networks, we can embed image and text feature into a common space.

3.3. Local Semantic Structure Preserving for Capturing Inter-Modal Correla-

AN US

235

tions

Local semantic structure preserving focuses on modeling the neighborhood relationship between individual data points. In other words, it refers to how to model the similarity relations in the Hamming space with the neighborhood graph S. Specifically, we expect the neighbor image-text pairs to have small

M

Hamming distances, while the non-neighbor image-text pairs to have large Hamming distances. Inspired by the successful application of metric learning in face

ED

verification [59], object recognition [60], image set classification [61], cross-modal matching [46], we would like to introduce the metric learning method into crossmodal hashing. However, most metric learning methods use Euclidean distance

PT

or cosine distance to measure the similarities between different data points. We will extend the metric learning method to Hamming space and reformulate the local semantic structure preserving as a metric learning problem. Specifically, (x)

CE

for each hash code bi

(the green circle in Fig. 1) in the image modality, we (y)

expect the hash code of similar instance bj

(the blue circle in Fig. 1) in the

AC

text modality to fall within a given Hamming radius µ1 − τ1 (µ1 > τ1 > 0) of (x)

(y)

bi , and the hash code of dissimilar instance bk

(the blue rectangle in Fig.

1) in the text modality to fall out of a given Hamming distance radius µ1 + τ1 . Therefore, a large margin 2τ1 will be generated between dissimilar instances

from the same modality and a large margin µ1 + τ1 will be generated between dissimilar instances from the different modalities as shown in Fig. 1. We define 11

ACCEPTED MANUSCRIPT

the following loss for penalizing the violation of the distance constraints above:

=

  δ(dH (bi(x) , b(y) j ) ≥ µ1 − τ1 ), if Sij = 1 

(x)

(y)

(3)

δ(dH (bi , bj ) ≤ µ1 + τ1 ), otherwise.

CR IP T

(x) (y) ϕ(bi , bj , Sij )

where δ(·) is an indicator function which returns 1 if the condition within the

brackets is satisfied and 0 otherwise. In Eqn. (3), a const penalty is given for

the pairwise samples which violate the distance constraints. Then the proposed large margin inter-modal hashing loss is defined as: =

n X n X

(x)

(y)

ϕ(bi , bj , Sij ).

i=1 j=1

(4)

AN US

J

xy

3.4. Global Semantic Structure Preserving for Capturing Intra-Modal Correlations

Global semantic structure preserving focuses on the preservation of the global semantic distribution. Meanwhile, the global semantic structure should

M

consider the multilevel similarity information between instances. In addition, the multi-modal instances are usually associated with multiple labels. To pre-

ED

serve such multilevel similarity between hash codes, [42] utilizes an adaptive weighted triplet loss. However, such a triplet loss cannot make full use of the global multilevel similarity information. In contrast, we utilize the entire seman-

PT

tic affinity matrix [30] of training instances as supervised information to learn the global semantic preserving hashing codes for each modality. Specifically, we define the semantic affinity matrix A ∈ Rn×n as the number of common labels

AC

CE

between two data points. The element Aij is given as follows: Aij = zTi zj .

(5)

Therefore, we can compute a joint probability distribution P in the semantic

space from the semantic affinity matrix A. The element pij is derived as follows: pij = Pn

i=1

A Pnij

j=1,j6=i

12

Aij

.

(6)

ACCEPTED MANUSCRIPT

240

Because we are only interested in modeling pairwise similarities, the value of pii Pn P is set to zero. Thus i=1 j=1,j6=i pij = 1.

In the low-dimensional Hamming space, we can utilize the Hamming distance

CR IP T

to derive the probability distributions Qx and Qy for the image modality and the

x text modality respectively. For example, we define qij as the joint probability

distribution for bxi and bxj in the image modality as follows: x

x

e−dH (bi ,bj ) . Pn x −dH (bx k ,bm ) k=1 m=1,m6=k e

x qij = Pn

(7)

x The value of qij determines how similar the two instances bxi and bxj are. Again,

AN US

x since we are only interested in modeling pairwise similarities, the value of qii is y set to zero. Similarly, we can define qij as follows: y

y

e−dH (bi ,bj ) . Pn y −dH (by k ,bm ) k=1 m=1,m6=k e

y qij = Pn

(8)

y where the value of qii is set to zero. In order to preserve the global multilevel

semantic affinities in the resulting hash codes of each modality, we propose to

M

minimize the Kullback-Leibler divergence between the joint probability distribution P in the semantic space and the joint probability distribution Qx (Qy )

ED

in the Hamming space:

PT

J xx = KL(P k Qx ) = J yy = KL(P k Qy ) =

n n X X

pij log

pij x qij

(9)

n n X X

pij log

pij y qij

(10)

i=1 j=1,j6=i

i=1 j=1,j6=i

CE

3.5. Consistent Regularization Term Considering the semantic consistency across different modalities, the pro-

AC

posed method tends to generate unified hash codes for both modalities of any instance. Therefore, we introduce the following regularization term into the

245

objective function. R = kB − Bx k2F + kB − By k2F where B is the unified hash codes for both modalities. 13

(11)

ACCEPTED MANUSCRIPT

4. Optimization The objective function involves an indicator function and discrete variables, where these factors lead to a discontinuous and non-convex optimization prob-

CR IP T

lem. The optimization problem in Eqn. (2) is difficult to solve directly. There-

fore, we need to develop an efficient optimization algorithm to optimize the objective function. Firstly, we substitute a convex surrogate for the indicator

function in Eqn. (3). Specifically, we replace the indicator function δ(x ≥ 0)

with the logistic loss `(x) = log(1 + ex ). The main reason for choosing this surrogate is that this function is continuous, monotone increasing and differ-

AN US

entiable everywhere. In practical applications, we adopt its equivalent transformation `(x) = x − log(σ(x)) in order to avoid numerical overflow, where σ(x) = 1/(1 + e−x ). Therefore, we can approximate the Eqn. (3) as follows: n n X X i=1 j=1

n X n X i=1 j=1 n X n X

Sij δ(

L 1 (x) T (y) − µ1 + τ1 − bi bj ≥ 0)+ 2 2

(1 − Sij )δ(µ1 + τ1 −

M

J xy =

L 1 (x) T (y) + bi bj ≥ 0) 2 2 (12)

ED

1 (x) T (y) ≈ Sij `(µ + τ − bi bj )+ 2 i=1 j=1

PT

n X n X 1 (x) T (y) (1 − Sij )`(−µ + τ + bi bj ). 2 i=1 j=1

where µ = Bx∗i

and

By∗i

L 2

− µ1 and τ = τ1 . Secondly, we approximate the hash codes

with the real-valued output of the corresponding neural network.

CE

Specifically, we replace the hash codes Bx∗i and By∗i with f (xi ; θx ) and g(yi ; θy )

AC

y x respectively. Therefore, J xy in Eqn. (12), qij in Eqn. (7), qij in Eqn. (8), and

14

ACCEPTED MANUSCRIPT

R in Eqn. (11) can be rewritten as: J xy =

n X n X

1 Sij `(µ + τ − FT∗i G∗j )+ 2 i=1 j=1

T

1

x qij = Pn

k=1

e 2 F∗i F∗j Pn m=1,m6=k

1

T

e 2 F∗k F∗m

.

T

1

e 2 G∗i G∗j . Pn 1 T 2 G∗k G∗m k=1 m=1,m6=k e

y = Pn qij

AN US

R = kB − Fk2F + kB − Gk2F .

(13)

CR IP T

n X n X 1 (1 − Sij )`(−µ + τ + FT∗i G∗j ). 2 i=1 j=1

(14)

(15) (16)

The optimization problem of Eqn. (2) can be solved by alternating optimization of θx , θy and B. These parameters can be optimized one by one with other 250

parameters fixed. The entire alternating procedure is illustrated in Algorithm 1. The detailed step by step derivations are given below.

M

4.1. Update B with θx and θy fixed

ED

If we fix θx and θy , the optimization subproblem can be rewritten as follows: min kB − Fk2F + kB − Gk2F = −2Tr(BT V) + const B

(17)

PT

s.t. B ∈ {−1, 1}c×n

where V = F + G, and const = 2nc + kFk2F + kGk2F . We can update B with

CE

the sign of V as

B = sign(V) = sign(F + G).

(18)

AC

4.2. Update θx with θy and B fixed Since θy and B are fixed, we can utilize stochastic gradient descent (SGD)

to update the neural network parameter θx of the image modality with the back-propagation (BP) algorithm. More specifically, we randomly sample a mini-batch of points from the image modality, and derive the gradients of the objective function w.r.t. the real-valued output of the neural network for image

15

ACCEPTED MANUSCRIPT

CR IP T

Algorithm 1 GLSP algorithm Input: The image set X = {xi }ni=1 , text set Y = {yi }ni=1 , class labels {zi }ni=1 , cross-modal pairwise similarity matrix S, the parameters α, β, γ, µ, τ .

Output: Binary code matrix B, parameters θx and θy of the deep neural networks for the image modality and the text modality. Initialization:

Initialize network parameters θx and θy , mini-batch size Nx = Ny = 128,

1:

repeat

AN US

and the iteration number tx = dn/Nx e, ty = dn/Ny e; 2:

Update B according to Eqn. (18);

3:

for iter = 1, 2, . . . , tx do

4:

Randomly sample a mini-batch of Nx images from X;

5:

For each sampled image xi in the mini-batch, calculate the output

M

x f (xi ; θx ), update the matrix F∗i = f (xi ; θx ) and the probability qij

according to Eqn. (14);

Backpropagate the gradients according to Eqn. (19) and update the

ED

6:

network parameter θx . end for

8:

for iter = 1, 2, . . . , ty do

9: 10:

PT

7:

Randomly sample a mini-batch of Ny texts from Y;

For each sampled text yi in the mini-batch, calculate the output

CE

y g(yi ; θy ), update the matrix G∗i = g(yi ; θy ) and the probability qij

according to Eqn. (15);

AC

11:

12: 13:

Backpropagate the gradients according to Eqn. (20) and update the network parameter θy .

end for until convergence or reaching the maximum number of iterations.

16

ACCEPTED MANUSCRIPT

modality. Then we can compute

∂J ∂θx

with the chain rule. Finally, stochastic

gradient descent is performed to update the neural network parameter θx of the image modality. In particular, for each real-valued output F∗i of the neural ∂J ∂F∗i

∂J ∂Jxy ∂Jxx ∂R = +α +γ ∂F∗i ∂F∗i ∂F∗i ∂F∗i where

as follows:

CR IP T

network for image modality, we compute the gradient

(19)

n X ∂Jxy 1 =− Sij σ(µ + τ − FT∗i G∗j )G∗j + ∂F∗i 2 j=1 n X

AN US

1 (1 − Sij )σ(−µ + τ + FT∗i G∗j )G∗j 2 j=1

∂Jxx x = (qi∗ − pi∗ )FT ∂F∗i ∂R = 2(F∗i − B∗i ) ∂F∗i

(20) (21) (22)

M

4.3. Update θy with θx and B fixed

When θx and B are fixed, we use SGD to update the neural network pa-

255

ED

rameter θy of the text modality with the back-propagation (BP) algorithm. More specifically, for each real-valued output G∗i of the neural network for text

PT

modality, we compute the gradient

∂J ∂G∗i

as follows:

∂J ∂Jxy ∂Jxx ∂R = +α +γ ∂G∗i ∂G∗i ∂G∗i ∂G∗i

(23)

AC

CE

where

n X 1 ∂Jxy =− Sij σ(µ + τ − GT∗i F∗j )F∗j + ∂G∗i 2 j=1 n X

1 (1 − Sij )σ(−µ + τ + GT∗i F∗j )F∗j 2 j=1

∂Jxx y = (qi∗ − pi∗ )GT ∂G∗i ∂R = 2(G∗i − B∗i ) ∂G∗i

(24) (25) (26)

17

ACCEPTED MANUSCRIPT

4.4. Implementation Tricks In order to take full advantage of the whole training set, the real-valued

260

outputs of the neural networks for image modality and text modality are stored

CR IP T

in two matrices (i.e., F and G) respectively. In the training procedure, we

need to divide the whole training set into mini-batches. Then in each iteration, we randomly feed a mini-batch of points to the neural network and utilize the 265

real-valued outputs of the neural network to update the corresponding matrix. According to Algorithm 1, the corresponding parameters can be updated in each

iteration. It is worth noting that nT pairwise data points are incorporated into

AN US

the updating of the neural network in each iteration, where T is the batch size

and n is the number of training instances. Although it takes a slightly higher 270

computation and memory cost, the introduction of the whole training instances can boost the proposed method more robust against noise and outliers. y x The computation qij and qij in Eqn. (14) and Eqn. (15) involves all pairwise

data points in each iteration. Thus there are a large mount of repeated com-

275

M

y x putations in the computation of qij and qij in each iteration. For example, in

(t + 1)-th iteration, we randomly sample T images and feed them to the neural

ED

network of image modality. In addition, we use idx1 to denote the indices of the T training images and idx2 to denote the indices of the remaining training

PT

x images. The denominator dt+1 of qij is as follows:

X

AC

CE

dt+1 =

where

Ft+1 ∗m

=

X

1

t+1 T

e 2 F∗k

Ft+1 ∗m

+

k∈idx1 m∈idx1 ,m6=k

2

X

X

1

t+1 T

e 2 F∗k

Ft+1 ∗m

+

(27)

k∈idx1 m∈idx2

X

X

1

t+1 T

e 2 F∗k

Ft+1 ∗m

k∈idx2 m∈idx2 ,m6=k

Ft∗m ,

if m ∈ idx2 . Therefore, the third term in dt+1 is computed

repeatedly, since it has been computed in dt . In order to reduce the computation

18

ACCEPTED MANUSCRIPT

cost, we can iteratively compute dt+1 as follows: X

X

t

1

e 2 F∗k

T

Ft∗m

k∈idx1 m∈idx1 ,m6=k

X

2

X

1

t

e 2 F∗k

T

Ft∗m

+

k∈idx1 m∈idx2

X

X

1

t+1 T

e 2 F∗k

Ft+1 ∗m

k∈idx1 m∈idx1 ,m6=k

2

X

X

1

t+1 T

e 2 F∗k

k∈idx1 m∈idx2



CR IP T

dt+1 = dt −

Ft+1 ∗m

(28)

+

The computation cost can decrease from (n2 − n)/2 pairs to T 2 − T + 2nT pairs if the symmetric property of inner product is used.

AN US

280

5. Experiments

In this section, we conduct experiments of cross-modality search on two benchmarks to validate the effectiveness of the proposed method. Each of the

285

M

two benchmarks consists of two modalities: image and text. In the following, we firstly introduce the experiment settings including the datasets, the baseline

ED

methods and the evaluation criteria. Then, in order to make fair comparisons, we will give the presentation and discussion of experimental results. Finally, we further investigate the parameter sensitivity of the proposed method. All

290

PT

the following experiments are run on a server with Intel(R) Xeon(R) E5-2620 [email protected] CPU, 128GB RAM and a NVIDIA TITAN Xp GPU with 12GB

CE

memory.

5.1. Experiment Settings

AC

5.1.1. Datasets

295

The proposed method is evaluated on two benchmark multi-modal datasets,

i.e., MIRFlickr-25K [62] and IAPRTC-12 [63]. Some statistics of these two datasets are introduced as follows: • MIRFlickr-25K consists of 25,000 image-text pairs which are annotated with some of 24 concepts. Several textual tags are provided for each image. 19

ACCEPTED MANUSCRIPT

Following the settings in [35], we remove the textual tag if its frequency is less than 20 in the dataset. Subsequently, we get 20,015 image-text pairs

300

for experiment. The text modality is represented as a 1386-dimensional

CR IP T

bag-of-words feature. For the hand-crafted feature based method, we use a 4096-dimensional CNN feature vector to represent each image. The dataset is randomly split into a query set of 2,000 image-text pairs and

a database set of 18,015 image-text pairs. We randomly sample 10,000

305

image-text pairs from the database instances as training instances.

• IAPRTC-12 consists of 20,000 images and each image is connected to

AN US

several descriptive sentences. Each instance is annotated with 275 pre-

defined labels. The size of the vocabulary is 4670. Therefore we use a 4670-dimensional bag-of-words vector to represent the text for each in-

310

stance. For the hand-crafted feature based cross-modal hashing methods, we use a 4096-dimensional CNN feature vector to represent each image.

M

The entire dataset is exploited for our experiment. The dataset is randomly split into a query set of 2,000 image-text pairs and a database of 18,000 image-text pairs. We randomly sample 10,000 image-text pairs

315

ED

from the database instances as training instances. 5.1.2. Baseline Methods

PT

We compare the proposed method with the following state-of-the-art crossmodal hashing methods. • Semantic Correlation Maximization (SCM) [26] learns hash functions via

CE

320

maximizing the semantic correlation between the two modalities with re-

AC

spect to the semantic labels. In the following experiments, we select the

325

sequential learning since it usually performs better than the direct eigendecomposition. • Semantics-Preserving Hashing (SePH) [30] learns unified binary codes via the Kullback-Leibler divergence between a probability distribution constructed with the pairwise similarity matrix and a estimated one con20

ACCEPTED MANUSCRIPT

structed with to-be-learnt hash codes, and the hash functions can be learnt via RBF kernel logistic regression. • Rank-order Preserving Hashing (RoPH) [31] learns hash functions via in-

330

CR IP T

tegrating a rank-order preserving loss with a ridge regression loss.

• Linear Subspace Ranking Hashing (LSRH) [33] learns hash functions via maximizing the rank correlation between two ranking subspaces.

• Deep Cross-Modal Hashing (DCMH) [35] learns high non-linear mapping functions via integrating feature learning and hash-code learning into an

335

AN US

end-to-end learning framework.

• Pairwise Relationship Deep Hashing (PRDH) [36] learns hash functions via integrating different types of pairwise constraints and additional decorrelation constraints to preserve the similarities and enhance the discriminative

5.1.3. Evaluation Criteria

M

ability of the hash codes respectively.

340

The multi-modal instances are usually associated with multiple labels and

ED

their similarity level to the query can give the refined ranking quality. Therefore, in the following experiments, the three commonly-used ranking metrics including Normalized Discounted Cumulative Gain (NDCG) [31], Average Cu-

PT

345

mulative Gain (ACG) [31], and weighted mean Average Precision (mAPw ) [31] are used to evaluate the ranking quality. These three metrics are defined as

CE

follows:

(1) NDCG. Given a single query q and a ranking list of p retrieved instances,

the NDCG score is defined as follows:

AC

350

N DCG@p =

p 1 X 2ri − 1 Z i=1 log(i + 1)

(29)

where ri is the similarity level of the i-th retrieved instance in the ranking list, i.e., the number of common labels shared between the query and the ith retrieved instance. Z is a normalization factor to ensure that the NDCG 21

ACCEPTED MANUSCRIPT

score equals to one for the correct ranking. The NDCG scores of all queries are 355

averaged in our evaluation. (2) ACG. The definition of ACG score is given as follows: 1X ri . p i=1

CR IP T

p

ACG@p =

(30)

Similarly the ACG scores of all queries are also averaged to evaluate the ranking quality.

(3) The weighted MAP. The definition of the weighted MAP is given as follows:

Q

APω (q) =

1 X APω (q) Q q=1 1

with

AN US

M APω @p =

pr>0

p X

(31)

δ(rt > 0)ACG@t,

t=1

where Q is the number of queries and pr>0 is the number of relevant data points in the ranking list.

Besides the three ranking metrics mentioned above, Mean average precision

M

360

(MAP) [7] in the Hamming ranking protocol and retrieval precisions within

ED

Hamming radius 2 (PH2) [7] in the hash lookup protocol of all the baselines are also reported in the following experiments.

PT

5.1.4. Implementation Details

Following the settings in [35], the first seven layers of the CNN for image

365

modality is initialized with the VGG-F [58] model which is pre-trained on Ima-

CE

geNet dataset [64]. In addition, we randomly initialize all the other parameters of the deep architecture. For both datasets, the mini-batch size and the number

AC

of outer-loop in Algorithm 1 are fixed to 128 and 500 respectively. We decrease

370

the learning rate from 10−1.5 to 10−3 evenly in the log space for each epoch (outer-loop). In the following experiments, µ and γ are empirically set to 0 and

1 respectively for all datasets. Subsequently, we will investigate the impact of α, β and τ on algorithm performance. For the comparing methods, experimental parameters are set according to the suggestions provided in the original papers.

22

ACCEPTED MANUSCRIPT

Table 1: The comparison of NDCG@100, ACG@100 and MAPw @500 on MIRFlickr-25K dataset from 16 bits to 128 bits. MIRFlickr-25K Method

Text query

32 bits

64 bits

128 bits

NDCG@100

ACG@100

MAPw @500

NDCG@100

ACG@100

MAPw @500

NDCG@100

ACG@100

MAPw @500

NDCG@100

ACG@100

MAPw @500

LSRH [33]

0.2326

1.2923

1.3152

0.2292

1.2543

1.2876

0.2493

1.3855

1.3966

0.2522

1.3621

1.3656

1.5431

0.3024

1.5709

1.5587

1.7825

0.3725

1.8399

1.8107

1.8174

0.3743

1.8668

1.8367

1.6823

0.3412

1.7322

1.7187

1.7223

0.3644

1.7973

1.7791

2.0729

0.4346

2.1219

2.0848

1.5915

0.3209

1.7240

1.7095

1.6095

0.3225

1.6419

1.6172

1.7519

0.3683

1.7875

1.7599

1.7121

0.3604

1.7562

1.7385

1.7660

0.3623

1.8274

1.8313

1.8352

0.3807

1.8958

1.8838

2.0307

0.4467

2.1309

2.0897

SCM [26]

0.2863

1.5127

1.4958

0.2984

1.5513

1.5350

0.2997

1.5552

SePH [30]

0.3287

1.6865

1.6728

0.3532

1.7686

1.7431

0.3643

1.8104

RoPH [31]

0.3218

1.6800

1.6625

0.3495

1.7790

1.7583

0.3666

1.8450

DCMH [35]

0.3060

1.6103

1.5908

0.3235

1.6612

1.6468

0.3342

1.7006

PRDH [36]

0.3182

1.6341

1.6203

0.3305

1.6827

1.6696

0.3472

1.7400

GLSP

0.3926

1.9353

1.9029

0.4099

2.0279

2.0026

0.4336

2.1180

LSRH [33]

0.2626

1.4484

1.4538

0.2735

1.4717

1.4720

0.2965

1.5910

SCM [26]

0.3012

1.5797

1.5529

0.3131

1.6062

1.5833

0.3205

1.6330

SePH [30]

0.3336

1.6565

1.6422

0.3553

1.7302

1.7106

0.3621

1.7786

RoPH [31]

0.3221

1.6125

1.6090

0.3469

1.7177

1.7015

0.3598

1.7265

DCMH [35]

0.3395

1.7303

1.7270

0.3415

1.7284

1.7420

0.3488

1.7604

PRDH [36]

0.3447

1.7624

1.7586

0.3538

1.7824

1.7738

0.3665

1.8441

GLSP

0.3527

1.8474

1.8539

0.4105

2.0205

1.9909

0.4219

2.0555

CR IP T

Image query

16 bits

AN US

Type

Table 2: The comparison of NDCG@100, ACG@100 and MAPw @500 on IAPRTC-12 dataset from 16 bits to 128 bits.

IAPRTC-12

Method

Text query

MAPw @500

NDCG@100

LSRH [33]

0.1155

0.6251

0.6332

0.1358

SCM [26]

0.0911

0.5195

0.5176

0.1364

SePH [30]

0.2137

0.9571

0.9365

0.2458

RoPH [31]

0.2046

0.9326

0.9136

0.2412

DCMH [35]

0.2008

0.9424

0.9127

0.2347

PRDH [36]

0.2031

0.9496

0.9391

0.2180

GLSP

0.2363

1.0983

1.0534

0.2699

LSRH [33]

0.1215

0.6358

0.6419

SCM [26]

0.1202

0.6091

0.5995

SePH [30]

0.2147

0.9725

0.9505

RoPH [31]

0.2107

0.9385

0.9151

0.8786

0.8494

64 bits

128 bits

ACG@100

MAPw @500

NDCG@100

ACG@100

MAPw @500

NDCG@100

ACG@100

MAPw @500

0.7146

0.7112

0.1385

0.7276

0.7206

0.1426

0.7462

0.7455

0.6627

0.6495

0.1401

0.6867

0.6654

0.1460

0.7025

0.6806

1.0603

1.0269

0.2630

1.0976

1.0550

0.2794

1.1383

1.0924

1.0344

1.0042

0.2572

1.1052

1.0703

0.2680

1.1360

1.0967

1.0717

1.0354

0.2606

1.1583

1.1174

0.2137

1.0118

0.9987

1.0035

0.9849

0.2451

1.0872

1.0585

0.2556

1.1215

1.0899

1.1644

1.1180

0.3115

1.2953

1.2359

0.3304

1.3555

1.2906

0.1452

0.7471

0.7406

0.1489

0.7561

0.7487

0.1575

0.7986

0.7886

0.1607

0.7530

0.7246

0.1643

0.7595

0.7309

0.1752

0.8002

0.7660

0.2412

1.0486

1.0178

0.2750

1.1349

1.0847

0.2930

1.1793

1.1222

0.2369

1.0371

1.0068

0.2657

1.1085

1.0714

0.2803

1.1543

1.1119

DCMH [35]

0.1811

0.9668

0.9451

0.2547

1.1254

1.0892

0.2655

1.1638

PRDH [36]

0.2447

1.0916

1.0598

0.2658

0.2068

1.1604

1.1237

0.2864

1.2203

1.1798

0.3002

1.2652

1.2190

GLSP

0.2104

0.9345

0.9173

0.2435

1.0611

1.0251

0.2901

1.2311

1.1804

0.3134

1.2901

1.2376

1.1304

We carefully implement PRDH on MatConvNet using the same network model

PT

375

32 bits

ACG@100

M

Image query

16 bits NDCG@100

ED

Type

in this paper. The parameters are carefully tuned and set according to the

CE

original paper.

5.2. Results and Analyses

AC

5.2.1. Results on MIRFlickr-25K

380

The experimental results including NDCG@100, ACG@100 and MAPw @500

of the proposed method and other methods on MIRFlickr-25K dataset are presented in Table 1. We can see that the proposed method outperforms the baselines, including non-deep supervised cross-modal hashing methods with the CNN features (SCM [26], SePH [30], LSRH [33] and RoPH [31]) and the deep

23

ACCEPTED MANUSCRIPT

Table 3: The comparison of MAP on MIRFlickr-25K and IAPRTC-12 datasets from 16 bits to 128 bits. MIRFlickr-25K

IAPRTC-12

Method 16 bits

32 bits

64 bits

128 bits

16 bits

LSRH [33]

0.6611

0.6716

0.6822

0.6909

0.3977

SCM [26]

0.6798

0.6890

0.6962

0.6998

0.3640

SePH [30]

0.6977

0.6782

0.6851

0.6895

0.4782

RoPH [31]

0.7038

0.7139

0.7230

0.7264

0.4638

DCMH [35]

0.7399

0.7519

0.7555

0.7644

0.4773

Image query

0.7344

0.7495

0.7551

0.7659

0.4831

GLSP

0.7464

0.7706

0.7927

0.7989

0.5040

LSRH [33]

0.6809

0.6859

0.6997

0.7121

0.4201

SCM [26]

0.6737

0.6824

0.6820

0.6854

0.3452

0.6966

0.7033

0.7373

0.7392

0.7990

0.7166

0.6912

0.7180

0.7298

DCMH [35]

0.7832

0.7980

PRDH [36]

0.7834

0.7941

GLSP

0.7538

0.7766

128 bits

0.4274

0.4386

0.4059

0.4084

0.4197

0.4933

0.4636

0.4732

0.4899

0.5076

0.5166

0.5039

0.5276

0.5437

0.5101

0.5294

0.5443

0.5360

0.5764

0.6017

0.4357

0.4483

0.4591

0.3587

0.3580

0.3632

0.4821

0.4995

0.4693

0.4779

0.4668

0.4951

0.5139

0.5239

0.8032

0.5306

0.5535

0.5804

0.5913

0.8006

0.8058

0.5359

0.5675

0.5870

0.5981

0.8026

0.8046

0.4825

0.5198

0.5653

0.5936

M

SePH [30] RoPH [31]

Text query

64 bits

0.4210

AN US

PRDH [36]

32 bits

CR IP T

Type

Table 4: The comparison of PH2 on MIRFlickr-25K and IAPRTC-12 datasets from 16

ED

bits to 128 bits.

MIRFlickr-25K

Type

32 bits

64 bits

128 bits

16 bits

32 bits

64 bits

128 bits

LSRH [33]

0.7649

0.8394

0.8858

0.9256

0.4940

0.5715

0.6398

0.7155

SCM [26]

0.7230

0.7848

0.8112

0.8329

0.3706

0.4989

0.5318

0.7567

SePH [30]

0.7991

0.9377

0.9692

0.9792

0.6605

0.8053

0.9416

0.9895

RoPH [31]

0.7948

0.8504

0.8686

0.9567

0.6371

0.8134

0.9830

0.9983

DCMH [35]

0.8531

0.9114

0.9487

0.9302

0.8531

0.9114

0.9487

0.9302

PRDH [36]

0.8641

0.9072

0.9661

1.0000

0.8641

0.9072

0.9661

1.0000

PT

16 bits

AC

CE

Image query

Text query

IAPRTC-12

Method

GLSP

0.9577

0.9496

1.0000

0.0000

0.9230

1.0000

0.0000

0.0000

LSRH [33]

0.7523

0.8344

0.8972

0.9839

0.4863

0.5678

0.6029

0.5653

SCM [26]

0.7570

0.8004

0.8515

0.9142

0.3769

0.4643

0.4673

0.3460

SePH [30]

0.8130

0.9320

0.9511

0.9611

0.6680

0.8112

0.9422

0.9927

RoPH [31]

0.7876

0.8469

0.8818

0.9443

0.6486

0.8297

0.9756

0.9993

DCMH [35]

0.8373

0.8946

0.9521

1.0000

0.8373

0.8946

0.9521

1.0000

PRDH [36]

0.8665

0.9043

0.9727

1.0000

0.8665

0.9043

0.9727

1.0000

GLSP

0.9755

0.9678

0.0000

0.0000

0.9358

0.0000

0.0000

0.0000

24

ACCEPTED MANUSCRIPT

385

supervised cross-modal hashing methods (DCMH [35] and PRDH [36]). The improvement can be attributed to the following reasons: 1) the introduction of metric learning can generate a large margin between similar and dissimilar hash

CR IP T

codes; 2) the introduction of supervised information with multilevel similarity can guide to preserving the global multilevel semantic structure into hash codes; 390

3) the end-to-end learning framework can enhance the feedback between the

hashing learning procedure and the feature learning procedure. SePH and RoPH perform better than SCM, LSRH, and DCMH in most case. The reason can be that SePH and RoPH can learn unified binary codes for different modalities

AN US

which can enhance the correlations across different modalities.

The MAP performance of the proposed methods and other methods on

395

MIRFlickr-25K dataset are presented in Table 3. For image query, the proposed method can achieve higher MAP performance than other baselines. For text query, the proposed method can achieve competitive MAP performance with DCMH and PRDH when using more bits (e.g., 64-bit and 128-bit). The PH2 performance of the proposed methods and other baselines on MIRFlickr-

M

400

25K dataset are presented in Table 4. For image query and text query, the

ED

proposed method performs better than other baselines from 16 bits to 64 bits and 16 bits to 32 bits respectively. Then, the precision values of the proposed method decrease sharply with the increasing number of hashing bits. The reason is that the Hamming space becomes increasing sparse when using longer

PT

405

codes and few data points fall within the Hamming ball with radius 2.

CE

5.2.2. Results on IAPRTC-12 The experimental results including NDCG@100, ACG@100 and MAPw @500

AC

of the proposed method and other methods on IAPRTC-12 dataset are pre-

410

sented in Table 2. For image query, we can see that the proposed method outperforms the baselines under different ranking metrics. For text query, PRDH performs better than the proposed method and the other baselines from 16 bits to 32 bits. It can be attributed to the introduction of the intra-modal similarity preservation and the intra-modal decorrelation constraints. In addition, 25

ACCEPTED MANUSCRIPT

415

SePH performs better than the proposed method and the other baselines except PRDH at 16 bits. The reason can be attributed to the kernel method for the text feature extraction. Moreover, SePH performs better than the other baselines

CR IP T

for most cases. The reason can be that SePH incorporates the global multilevel similarity into the resulting hash codes. From another point of view, it implies 420

that the introduction of multilevel similarity can improve the performance of cross-modal hashing for multi-label retrieval.

The MAP performance of the proposed methods and other methods on

IAPRTC-12 dataset are presented in Table 3. For image query, the pro-

425

AN US

posed method can achieve higher MAP performance than other baselines. For

text query, the proposed method can achieve competitive MAP performance with DCMH and PRDH when using more bits (e.g., 128-bit). The PH2 performance of the proposed methods and other baselines on IAPRTC-12 dataset are presented in Table 4. For image query and text query, the proposed method performs better than other baselines from 16 bits to 32 bits and 16 bits respectively. Then, the precision values of the proposed method decrease sharply with

M

430

the increasing number of hashing bits.

ED

5.3. Effect Analysis of Each Component The proposed method mainly consists of three components: 1) metric learn-

435

PT

ing for capturing inter-modal correlations; 2) semantic preserving hash learning for capturing intra-modal correlations; 3) consistent regularization constraint. In this subsection, we will investigate the effect of each component. Therefore,

CE

we define the following three alternative baselines: 1. GLSP-1: training the network model without margin, i.e., τ = 0.

AC

2. GLSP-2: training the network model without the global semantic struc-

440

ture preserving, i.e., J xy + γR. 3. GLSP-3: training the network model without consistent regularization constraint, i.e., J xy + αJ xx + βJ yy . The experimental results of these baselines on MIRFlickr-25K and IAPRTC-

12 datasets are shown in Table 5 and Table 6 respectively. We can see that 26

ACCEPTED MANUSCRIPT

Table 5: The effect of different component on MIRFlickr-25K dataset from 16 bits to 128 bits. MIRFlickr-25K Method

16 bits

32 bits

64 bits

NDCG@100

ACG@100

MAPw @500

NDCG@100

ACG@100

MAPw @500

NDCG@100

ACG@100

GLSP-1

0.3674

1.8908

1.8555

0.3744

1.9370

1.8902

0.3874

1.9402

GLSP-2

0.3563

1.8045

1.7697

0.3999

1.9969

1.9724

0.4232

2.0987

GLSP-3

0.3558

1.7526

1.7078

0.3487

1.7660

1.7565

0.3907

1.9677

GLSP

0.3926

1.9353

1.9029

0.4099

2.0279

2.0026

0.4336

2.1180

GLSP-1

0.3299

1.5374

1.5270

0.3107

1.5239

1.5219

0.3950

1.9612

GLSP-2

0.3422

1.7750

1.7749

0.4011

1.9662

1.9674

0.4041

2.0384

GLSP-3

0.3270

1.6390

1.6202

0.3808

1.9044

1.8789

0.4191

2.0354

GLSP

0.3527

1.8474

1.8539

0.4105

2.0205

1.9909

0.4219

2.0555

Image query

NDCG@100

ACG@100

MAPw @500

1.9044

0.4032

2.0174

1.9778

2.0676

0.4294

2.1012

2.0573

1.9329

0.4327

2.1203

2.0818

2.0729

0.4346

2.1219

2.0848

1.9243

0.3960

1.9204

1.9051

2.0287

0.4245

2.0606

2.0444

1.9899

0.4444

2.1185

2.0837

2.0307

0.4467

2.1309

2.0897

AN US

Text query

128 bits MAPw @500

CR IP T

Type

Table 6: The effect of different component on IAPRTC-12 dataset from 16 bits to 128 bits. IAPRTC-12

Type

Method

16 bits

32 bits

128 bits

ACG@100

MAPw @500

NDCG@100

ACG@100

MAPw @500

NDCG@100

ACG@100

MAPw @500

NDCG@100

ACG@100

MAPw @500

GLSP-1

0.2110

0.9766

0.9496

0.2378

1.0610

1.0256

0.2598

1.1504

1.1049

0.2735

1.1834

1.1455

GLSP-2

0.2218

1.0126

0.9721

0.2580

1.1187

1.0760

0.3046

1.2703

1.2131

0.3223

1.3277

1.2655

GLSP-3

0.2023

0.9075

0.8827

0.2554

1.0966

1.0523

0.2810

1.1701

1.1160

0.3214

1.3093

1.2460

GLSP

0.2363

1.0983

1.0534

0.2699

1.1644

1.1180

0.3115

1.2953

1.2359

0.3304

1.3555

1.2906

GLSP-1

0.1910

0.8854

0.8703

GLSP-2

0.2049

0.9252

0.9058

GLSP-3

0.1901

0.8633

0.8371

GLSP

0.2104

0.9345

0.9173

0.2235

1.0392

1.0049

0.2533

1.1182

1.0810

0.2719

1.1742

1.1304

0.2311

1.0247

0.9919

0.2809

1.1770

1.1381

0.3081

1.2793

1.2284

0.2164

0.9670

0.9373

0.2390

1.0264

0.9890

0.2967

1.2199

1.1659

0.2435

1.0611

1.0251

0.2901

1.2311

1.1804

0.3134

1.2901

1.2376

ED

Text query

M

Image query

445

64 bits

NDCG@100

the local semantic structure preserving (metric learning) (GLSP-1) contributes

PT

the highest retrieval performance followed by the consistent regularization constraint (GLSP-3) and the global semantic structure preserving (GLSP-2). The reason can be attributed to the local semantic structure preserving can provide

CE

a more accurate similar or dissimilar discrimination for two data points from

450

different modalities. Thus it can ensure that similar data points rank before dis-

AC

similar data points as much as possible. Finally, the best retrieval performance can be achieved when all the three terms are combined together. 5.4. Parameter Sensitivity Analysis There are five parameters in the proposed method, including α, β, γ, µ and τ .

455

The parameters γ and µ are fixed for all experiments. Therefore, we will only

27

ACCEPTED MANUSCRIPT

analyze the effect of the rest parameters (α, β and τ ) on the experimental performance in this subsection. More specifically, we conduct a series of experiments in which we change one parameter while fixing the other parameters, and report

460

CR IP T

the results in Fig. 2 and Fig. 3. From the results, we have the following observations. (1) In most cases, the proposed method cannot obtain the best results when α is set to large values; (2) In most cases, we cannot obtain the best results

when β is set to small values; (3) The results are usually not satisfactory when the margin parameter τ is set to too large or too small values. α and β control

the significance of each modality for preserving the global multilevel similarity. A large α may break the balance of all components. In addition, a small β may

AN US

465

not well preserve the multilevel similarity in the hash codes of text modality. τ controls the margin between similar hash codes and dissimilar hash codes across different modalities. A small τ cannot well preserve the discriminative ability of hash codes. A large τ will introduce large amount of hard pairwise examples,

10 -4

10

1

10 2

10 -4

10 -2

10 4

1

10 2

MIRFlickr-25K @ 64 bits

10 -4

10

1

10 2

1

10 2

α

β

NDCGt2i @ 100

0.2 10 -4

0.4 0.3 0.2

10 -4

10 -2 1 10 2

10 4

1

7

1

10 2

MIRFlickr-25K @ 64 bits

0.2

10 -4

10 -2

1

β

10 4

1

5

3

τ

7

10

7

1 10 2

1

1

MIRFlickr-25K @ 64 bits

10 4

1

5

3

7

10 -4 1 10 2

β

10 4

1

5

3

τ

7

10

10 4

1

5

3

7

1 10 2

β

10 4

1

5

3

7

10

wMAPt2i @ 500

1 10 2

1 10 4

1

5

3

7

10 4

1

5

3

τ

7

10

10 2

1 10 -2 1 10 2 10 4

α

τ

1

5

3

7

10

τ

MIRFlickr-25K @ 64 bits

2.5

2 1.5 1

2 1.5 1 10 -4

10 -2 1 10 2

β

10 4

1

5

3

τ

7

10

10 -2 1 10 2

β

10 4

1

5

3

7

τ

Figure 2: The experimental performances for the variations in the value of hyper-parameters on the MIRFlickr-25K dataset. The hash code length is fixed at 64 bits.

28

10 4

β

2

10

10 -4 1

1

1.5

MIRFlickr-25K @ 64 bits

10 -2

10 -2

10 -4 10 -2

α

1

10 -4

MIRFlickr-25K @ 64 bits

1

τ

1.5

10 4

α

2.5

β

τ

10 -2

β

2

10 2

2

10 2

1

10 4

1.5

10

10 -4 10 -2

10 2

2.5

MIRFlickr-25K @ 64 bits

1

10

1

10 -4 1

2.5

1.5

10 -4

-2

MIRFlickr-25K @ 64 bits

10 -2

α

2

10 4

α

β

1

τ

10 -4 10 -2

1 10 2

2

10 2

2 1.5

10 -4 10 -2

10 4

2.5

MIRFlickr-25K @ 64 bits

0.2

10

10 2

1.5

10

2.5

0.3

10 -4

1

10 -4 10 -2

α

1

MIRFlickr-25K @ 64 bits

2

τ

0.4

10 4

α

β

-2

2.5

10 2

2 1.5

10 -4 10 -2

10 4

1.5

10

ACGi2t @ 100

0.3

1

5

3

0.5

NDCGt2i @ 100

0.4

10 2

10 4

α

τ

0.5

10

10 2

10 -4

10 -2

10

CE

α

5

3

10 -4

1

2.5

PT

0.3

10 4

-2

MIRFlickr-25K @ 64 bits

0.5

0.4

1 10 -4

10 -2

10 4

MIRFlickr-25K @ 64 bits

0.5

NDCGi2t @ 100

10 4

α

β

-2

2 1.5

2.5

wMAPt2i @ 500

10 4

α

-2

ACGt2i @ 100

10 2

1

wMAPi2t @ 500

1

2

1.5

MIRFlickr-25K @ 64 bits

2.5

wMAPt2i @ 500

0.2 10 -4

10 -2

NDCGi2t @ 100

0.3

ED

10 -4

0.4

ACGt2i @ 100

0.2

MIRFlickr-25K @ 64 bits

2.5

ACGt2i @ 100

0.3

MIRFlickr-25K @ 64 bits

2.5

ACGi2t @ 100

NDCGt2i @ 100

NDCGi2t @ 100

0.4

AC

MIRFlickr-25K @ 64 bits

0.5

wMAPi2t @ 500

MIRFlickr-25K @ 64 bits

0.5

wMAPi2t @ 500

MIRFlickr-25K @ 64 bits

M

which make the model hard to train.

ACGi2t @ 100

470

10

ACCEPTED MANUSCRIPT

1

1 10

β

IAPRTC-12 @ 64 bits

10 -2

1

10 4

1 10 2

0.2 -4

0.3 0.25 0.2

10

2

10 4

α

1

5

3

7

1 10

2

10 4

α

τ

IAPRTC-12 @ 64 bits

5

3

1

7

0.2 10 -4 1 10 2

β

10 4

1

5

3

7

10

1 10 2

1.2 1.1

10 2 10 4

β

τ

5

3

1

7

10

4

1

5

3

1.1 1

1.2 1.1 1 10 -4

10 -2 1 10 2 10

α

τ

4

1

5

3

7

10

1.3

1.2 1.1 1

1.2 1.1

10 -4

10 2

β

τ

10

4

1

5

3

1

10 2

β

τ

10

4

1

wMAPt2i @ 500

1 10 2

1

5

3

7

10 -4

10 -2

1

10 2

1.3 1.2

10

4

1

5

3

7

1.1

1

10 -2

10

1

10 2

10 4

α

τ

1

5

3

7

IAPRTC-12 @ 64 bits

1.4

1

10 2

β

τ

10

4

1

5

3

τ

1.3 1.2 1.1

1

10 -4 7

10

10 -2 1 10 2

β

10 4

1

5

3

7

τ

on the IAPRTC-12 dataset. The hash code length is fixed at 64 bits.

M

In this subsection, we present the value of the objective function with varying iterations on two datasets. The convergence curves on MIRFlickr-25K

ED

and IAPRTC-12 datasets at 64 bits are shown in Fig. 4 (a) and Fig. 4 (b) 475

respectively. From these figures, we can see that the value of the objective function decreases steadily with more iterations, which validates the effectiveness of

PT

Algorithm 1.

×10 7

MIRFlickr @ 64 bits

14

12

13

11

12

10

11

Objective function value

Objective function value

AC

CE

13

9 8 7 6 5 4 3

×10 7

IAPR TC-12 @ 64 bits

10 9 8 7 6 5

0

100

200

300

400

4

500

The number of iterations

0

100

200

300

400

500

The number of iterations

(a)

(b)

Figure 4: Convergence curves on MIRFlickr-25K and IAPRTC-12 datasets at 64 bits.

29

10

τ

Figure 3: The experimental performances for the variations in the value of hyper-parameters

5.5. Convergence Study

10 4

β

IAPRTC-12 @ 64 bits

10 -2

10

10 4

α

10 -4

10 -2

10

10 -2

β

IAPRTC-12 @ 64 bits

1.3

7

1

10 -4

α

τ

1.3

1

1.1

10 4

1.4

10 2

1.4

1

1

1

IAPRTC-12 @ 64 bits

1.1

10 -2

10 -2

1.4

1.2

10

-4

10 2

IAPRTC-12 @ 64 bits

1.2

10

10 4

α

β

1.4

10 -2

10

10 2

10 -4 7

1.2

10 -4 1

1.3

10 -4 1

1

10 -2

10 4

IAPRTC-12 @ 64 bits

IAPRTC-12 @ 64 bits

10 -2

10

-4

10 2

1.4

α

0.2

10 4

α

β

1

10 -4 10 -2

1

10 4

10 -2

1.3

τ

0.3

1 10 -4

1.4

10 2

0.25

1.1

1.3

10

ACGi2t @ 100

NDCGt2i @ 100

0.3 0.25

10 -2

10 -2

0.35

1.2

1.4

IAPRTC-12 @ 64 bits

0.35

10

-4

10 2

10 -4

10 -2

10

1 10 -2

1

10 -4 1

1.1

IAPR TC-12 @ 64 bits

ACGi2t

0.3 0.25

10 4

α

β

IAPRTC-12 @ 64 bits

NDCGt2i @ 100

NDCGi2t @ 100

10 -4

10 2

0.35

10 -2

NDCGi2t @ 100

10 4

α

0.35

10

2

1.2

10 -4 10 -2

wMAPt2i @ 500

10 -2

1

CR IP T

10 -4

10 4

1.1

ACGt2i @ 100

10 4

α

10 2

1.2

10 -4

10 -2

ACGt2i @ 100

10

2

1.3

wMAPi2t @ 500

1

1.4

1.3

wMAPt2i @ 500

0.2 10 -4

10 -2

IAPRTC-12 @ 64 bits

1.4

1.3

wMAPi2t @ 500

0.3 0.25

IAPRTC-12 @ 64 bits

1.4

1.3

wMAPi2t @ 500

-4

IAPRTC-12 @ 64 bits

1.4

AN US

0.2

ACGi2t @ 100

NDCGt2i @ 100

NDCGi2t @ 100

0.3 0.25

10

IAPRTC-12 @ 64 bits

IAPRTC-12 @ 64 bits

0.35

ACGt2i @ 100

IAPRTC-12 @ 64 bits

0.35

10

ACCEPTED MANUSCRIPT

Table 7: The comparison of training time and test time (in seconds) on MIRFlickr-25K dataset from 16 bits to 128 bits. MIRFlickr-25K 16 bits

32 bits

64 bits

128 bits

Training

Test time (s)

Training

Test time (s)

Training

CR IP T

Method Test time (s)

time (s)

Image

Text

time (s)

Image

Text

time (s)

Image

LSRH [33]

3.45e1

7.60e-3

3.80e-3

5.73e1

1.06e-2

5.70e-3

1.13e2

SCM [26]

1.53e4

7.90e-3

2.70e-3

2.36e4

5.70e-3

2.40e-3

Training

Test time (s)

Text

time (s)

Image

Text

1.89e-2

1.08e-2

2.24e2

3.50e-2

1.97e-2

4.56e4

1.03e-2

5.30e-3

9.15e4

1.59e-2

6.80e-3

1.94e2

7.86e-2

3.70e-2

2.00e2

6.42e-2

3.10e-2

3.22e2

6.21e-2

3.09e-2

4.89e2

6.30e-2

3.00e-2

3.86e1

6.63e-2

2.49e-2

4.02e1

5.29e-2

1.65e-2

6.00e1

5.72e-2

2.51e-2

9.67e1

5.89e-2

2.61e-2

DCMH [35]

5.16e3

1.56e0

5.34e-2

5.33e3

1.64e0

4.52e-2

5.53e3

1.68e0

3.99e-2

6.53e3

1.51e0

5.26e-2

PRDH [36]

6.00e3

1.54e0

4.95e-2

6.34e3

1.56e0

5.81e-2

6.77e3

1.69e0

4.74e-2

8.33e3

1.40e0

4.71e-2

GLSP

9.29e3

1.53e0

5.36e-2

9.88e3

1.57e0

4.32e-2

1.06e4

1.44e0

5.41e-2

1.24e4

1.63e0

6.53e-2

AN US

SePH [30] RoPH [31]

5.6. Comparison of Training and Test Time

The training and test time for all baselines are listed in Tables 7 and 8. For the training time, the traditional cross-modal hashing methods including LSRH,

M

480

SePH and RoPH are more efficient than deep cross-modal hashing methods including DCMH, PRDH, and GLSP. The proposed method GLSP is slightly

ED

slower than DCMH and PRDH. For the test time, the time cost of CNN feature extraction is not considered for the traditional cross-modal hashing methods 485

including LSRH, SCM, SePH, and RoPH. If the time of extracting feature is

PT

considered in these non-deep cross-modal hashing methods, then the deep crossmodal hashing methods can have competitive efficiency with them. Because

CE

DCMH, PRDH, and GLSP share the same network architecture, their test times are closed. Therefore, in general, the proposed method GLSP is effective and relatively efficient for the large-scale retrieval tasks.

AC

490

6. Conclusion To improve the discriminative ability while preserving the multilevel seman-

tic structure of hash codes, we propose a global and local semantics-preserving

30

ACCEPTED MANUSCRIPT

Table 8: The comparison of training time and test time (in seconds) on IAPRTC-12 dataset from 16 bits to 128 bits. IAPRTC-12 16 bits

32 bits

64 bits

128 bits

Training

Test time (s)

Training

Test time (s)

Training

CR IP T

Method Test time (s)

time (s)

Image

Text

time (s)

Image

Text

time (s)

Image

LSRH [33]

3.63e1

7.40e-3

7.90e-3

7.20e1

1.06e-2

1.11e-2

1.47e2

SCM [26]

1.10e4

3.70e-3

3.80e-3

2.36e4

5.60e-3

5.70e-3

Training

Test time (s)

Text

time (s)

Image

Text

1.82e-2

2.00e-2

2.87e2

3.73e-2

3.67e-2

4.62e4

8.00e-3

9.20e-3

8.99e4

1.52e-2

1.71e-2

1.60e2

5.79e-2

6.33e-2

2.19e2

5.65e-2

5.79e-2

2.77e2

5.99e-2

6.46e-2

4.84e2

5.86e-2

6.31e-2

3.14e1

5.54e-2

6.28e-2

4.17e1

5.68e-2

6.17e-2

6.30e1

6.05e-2

6.50e-2

9.41e1

5.67e-2

6.12e-2

DCMH [35]

5.42e3

1.61e0

1.59e-1

5.51e3

1.43e0

1.26e-1

5.51e3

1.70e0

1.43e-1

6.87e3

1.54e0

2.03e-1

PRDH [36]

6.38e3

1.55e0

1.43e-1

6.29e3

1.60e0

1.25e-1

6.47e3

1.40e0

1.26e-1

8.46e3

1.67e0

1.74e-1

GLSP

1.07e4

1.62e0

1.53e-1

1.12e4

1.65e0

1.63e-1

1.18e4

1.57e0

1.64e-1

1.31e4

1.37e0

1.30e-1

AN US

SePH [30] RoPH [31]

based deep hashing method for cross-modal retrieval. More specifically, we in495

troduce a metric learning method for improving the discriminative ability of hash codes from different modalities. Meanwhile the local semantic structure is

M

preserved into the hash codes. In addition, a multilevel semantic affinity matrix is constructed to learn the global semantic structure preserving hash codes for

500

ED

each modality. Subsequently, in order to further enhance the performance of cross-modal retrieval, a consistent regularization term is introduced to guarantee that the resulting hash codes are as consistent as possible across different

PT

modalities. Finally, they are integrated into an end-to-end learning framework. The entire network model is trained via an efficient optimization algorithm.

CE

The experimental results carried out on two cross-modal benchmark datasets 505

demonstrate the effectiveness of the proposed method. In order to learn consistent hash codes across different modalities, two simple

AC

consistent square constraints are introduced. Some other complex technologies such as collective quantization, collective matrix factorization and so on can be used to further improve the correlations across different modalities. In addition,

510

different network models can be also exploited in the future work.

31

ACCEPTED MANUSCRIPT

Acknowledgment This work was supported in part by National Natural Science Foundation of

References 515

References

CR IP T

China under Grant 61525102, Grant 61502084, and Grant 61601102.

[1] L. Ma, H. Li, F. Meng, Q. Wu, K. N. Ngan, Learning efficient binary codes

from high-level feature representations for multi-label image retrieval, IEEE

AN US

Transactions on Multimedia PP (99) (2017) 1–1.

[2] A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in: International Conference on Very Large Data Bases, 1999, pp.

520

518–529.

M

[3] Y. Gong, S. Lazebnik, A. Gordo, F. Perronnin, Iterative quantization: A procrustean approach to learning binary codes for large-scale image re-

525

ED

trieval, IEEE Trans. Pattern Anal. Mach. Intell. 35 (12) (2013) 2916–2929. [4] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Advances in Neural Information Processing Systems, 2008, pp. 1753–1760.

PT

[5] W. Liu, J. Wang, S. Kumar, S. Chang, Hashing with graphs, in: International Conference on Machine Learning, 2011, pp. 1–8.

CE

[6] F. Shen, C. Shen, W. Liu, H. T. Shen, Supervised discrete hashing, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp.

530

AC

37–45.

[7] J. Wang, O. Kumar, S. Chang, Semi-supervised hashing for scalable image retrieval, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3424–3431.

32

ACCEPTED MANUSCRIPT

535

[8] T. Song, J. Cai, T. Zhang, C. Gao, F. Meng, Q. Wu, Semi-supervised manifold-embedded hashing with joint feature representation and classifier learning, Pattern Recognition 68 (2017) 99–110.

CR IP T

[9] W. Liu, J. Wang, R. Ji, Y. Jiang, S. Chang, Supervised hashing with kernels, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2074–2081.

540

[10] W. Li, S. Wang, W. Kang, Feature learning based deep supervised hashing with pairwise labels, in: International Joint Conference on Artificial

AN US

Intelligence, 2016, pp. 1711–1717.

[11] S. Chen, F. Shen, Y. Yang, X. Xu, J. Song, Supervised hashing with adaptive discrete optimization for multimedia retrieval, Neurocomputing 253

545

(2017) 97–103.

[12] Y. Xu, F. Shen, X. Xu, L. Gao, Y. Wang, X. Tan, Large-scale image

M

retrieval with supervised sparse hashing, Neurocomputing 229 (2017) 45– 53.

[13] C. Deng, X. Liu, Y. Mu, J. Li, Large-scale multi-task image labeling with

ED

550

adaptive relevance discovery and feature hashing, Signal Processing 112

PT

(2015) 137–145.

[14] C. Deng, H. Deng, X. Liu, Y. Yuan, Adaptive multi-bit quantization for

CE

hashing, Neurocomputing 151 (2015) 319–326. 555

[15] X. Liu, C. Deng, Y. Mu, Z. Li, Boosting complementary hash tables for fast

AC

nearest neighbor search, in: Proceedings of AAAI Conference on Artificial Intelligence, 2017.

[16] D. Zhang, F. Wang, L. Si, Composite hashing with multiple information

560

sources, in: ACM SIGIR Conference on Research and Development in Information Retrieval, 2011, pp. 225–234.

33

ACCEPTED MANUSCRIPT

[17] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, Y. Zhuang, Sparse multi-modal hashing, IEEE Transactions on Multimedia 16 (2) (2014) 427–439. [18] J. Zhou, G. Ding, Y. Guo, Latent semantic sparse hashing for cross-modal

ment in Information Retrieval, 2014, pp. 415–424.

565

CR IP T

similarity search, in: ACM SIGIR Conference on Research and Develop-

[19] G. Ding, Y. Guo, J. Zhou, Collective matrix factorization hashing for multi-

modal data, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2083–2090.

AN US

[20] G. Irie, H. Arai, Y. Taniguchi, Alternating co-quantization for cross-modal hashing, in: IEEE International Conference on Computer Vision, 2015, pp.

570

1886–1894.

[21] D. Wang, X. Gao, X. Wang, L. He, Semantic topic multimodal hashing for cross-media retrieval, in: Proceedings of the International Joint Conference

575

M

on Artificial Intelligence, 2015, pp. 3890–3896.

[22] L. Xie, L. Zhu, G. Chen, Unsupervised multi-graph cross-modal hashing

ED

for large-scale multimedia retrieval, Multimedia Tools Appl. 75 (15) (2016) 9185–9204.

PT

[23] M. Long, Y. Cao, J. Wang, P. S. Yu, Composite correlation quantization for efficient multimodal retrieval, in: Proceedings of International ACM SIGIR conference on Research and Development in Information Retrieval,

580

CE

2016, pp. 579–588.

AC

[24] L. Zhu, Z. Huang, X. Liu, X. He, J. Sun, X. Zhou, Discrete multimodal

585

hashing with canonical views for robust mobile landmark search, IEEE Transactions on Multimedia 19 (9) (2017) 2066–2079.

[25] X. Liu, J. He, C. Deng, B. Lang, Collaborative hashing, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2147–2154.

34

ACCEPTED MANUSCRIPT

[26] D. Zhang, W. Li, Large-scale supervised multimodal hashing with semantic correlation maximization, in: Proceedings of AAAI Conference on Artificial Intelligence, 2014, pp. 2177–2183. [27] B. Wu, Q. Yang, W. Zheng, Y. Wang, J. Wang, Quantized correlation

CR IP T

590

hashing for fast cross-modal search, in: Proceedings of International Joint Conference on Artificial Intelligence, 2015, pp. 3946–3952.

[28] H. Liu, R. Ji, Y. Wu, G. Hua, Supervised matrix factorization for crossmodality hashing, in: Proceedings of International Joint Conference on Artificial Intelligence, 2016, pp. 1767–1773.

AN US

595

[29] D. Wang, X. Gao, X. Wang, L. He, B. Yuan, Multimodal discriminative binary embedding for large-scale cross-modal retrieval, IEEE Transactions on Image Processing 25 (10) (2016) 4540–4554.

[30] Z. Lin, G. Ding, J. Han, J. Wang, Cross-view retrieval via probability-based

(2017) 1–14.

M

semantics-preserving hashing, IEEE Transactions on Cybernetics PP (99)

600

ED

[31] K. Ding, B. Fan, C. Huo, S. Xiang, C. Pan, Cross-modal hashing via rankorder preserving, IEEE Transactions on Multimedia 19 (3) (2017) 571–585.

PT

[32] X. Xu, F. Shen, Y. Yang, H. T. Shen, X. Li, Learning discriminative binary codes for large-scale cross-modal retrieval, IEEE Transactions on Image

605

CE

Processing 26 (5) (2017) 2494–2507. [33] K. Li, G. Qi, J. Ye, K. A. Hua, Linear subspace ranking hashing for cross-

AC

modal retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 39 (9) (2017)

610

1825–1838.

[34] L. Liu, Z. Lin, L. Shao, F. Shen, G. Ding, J. Han, Sequential discrete hashing for scalable cross-modality similarity retrieval, IEEE Trans. Image Processing 26 (1) (2017) 107–118.

35

ACCEPTED MANUSCRIPT

[35] Q. Jiang, W. Li, Deep cross-modal hashing, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3270–3278. 615

[36] E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, X. Gao, Pairwise relationship

CR IP T

guided deep hashing for cross-modal retrieval, in: Proceedings of AAAI Conference on Artificial Intelligence, 2017, pp. 1618–1625.

[37] Y. Cao, M. Long, J. Wang, Q. Yang, P. S. Yu, Deep visual-semantic hashing for cross-modal retrieval, in: Proceedings of ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, 2016, pp. 1445–

620

AN US

1454.

[38] Y. Cao, M. Long, J. Wang, S. Liu, Collective deep quantization for efficient cross-modal retrieval, in: Proceedings of AAAI Conference on Artificial Intelligence, 2017, pp. 3974–3980. 625

[39] Y. Cao, M. Long, J. Wang, Correlation hashing network for efficient cross-

M

modal retrieval, CoRR abs/1602.06697.

[40] L. Liu, F. Shen, Y. Shen, X. Liu, L. Shao, Deep sketch hashing: Fast

ED

free-hand sketch-based image retrieval, CoRR abs/1703.05605. [41] C. Deng, Z. Chen, X. Liu, X. Gao, D. Tao, Triplet-based deep hashing network for cross-modal retrieval, IEEE Transactions on Image Processing

630

PT

27 (8) (2018) 3893–3903.

[42] F. Zhao, Y. Huang, L. Wang, T. Tan, Deep semantic ranking based hashing

CE

for multi-label image retrieval, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1556–1564.

[43] K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey on

AC

635

cross-modal retrieval, CoRR abs/1607.06215.

[44] N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, N. Vasconcelos, A new approach to cross-modal multimedia retrieval, in: Proceedings of the International Conference on Multimedia,

640

2010, pp. 251–260. 36

ACCEPTED MANUSCRIPT

[45] C. Deng, X. Tang, J. Yan, W. Liu, X. Gao, Discriminative dictionary learning with common label alignment for cross-modal retrieval, IEEE Trans. Multimedia 18 (2) (2016) 208–218.

CR IP T

[46] V. E. Liong, J. Lu, Y. Tan, J. Zhou, Deep coupled metric learning for cross-modal matching, IEEE Trans. Multimedia 19 (6) (2017) 1234–1244.

645

[47] H. J´egou, M. Douze, C. Schmid, Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2011) 117–128.

[48] A. Babenko, V. S. Lempitsky, Additive quantization for extreme vector

nition, 2014, pp. 931–938.

650

AN US

compression, in: IEEE Conference on Computer Vision and Pattern Recog-

[49] T. Zhang, C. Du, J. Wang, Composite quantization for approximate nearest neighbor search, in: Proceedings of the International Conference on Machine Learning, 2014, pp. 838–846.

M

[50] K. He, F. Wen, J. Sun, K-means hashing: An affinity-preserving quantization method for learning binary compact codes, in: IEEE Conference on

655

ED

Computer Vision and Pattern Recognition, 2013, pp. 2938–2945. [51] X. Liu, Z. Li, C. Deng, D. Tao, Distributed adaptive binary quantization for fast nearest neighbor search, IEEE Trans. Image Processing 26 (11)

660

PT

(2017) 5324–5336.

[52] X. Liu, B. Du, C. Deng, M. Liu, B. Lang, Structure sensitive hashing with

CE

adaptive product quantization, IEEE Trans. Cybernetics 46 (10) (2016) 2252–2264.

AC

[53] T. Zhang, J. Wang, Collaborative quantization for cross-modal similarity

665

search, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2036–2045.

[54] E. Yang, C. Deng, C. Li, W. Liu, J. Li, D. Tao, Shared predictive crossmodal deep quantization, IEEE Transactions on Neural Networks and Learning Systems PP (99) (2018) 1–12. 37

ACCEPTED MANUSCRIPT

[55] B. Dai, R. Guo, S. Kumar, N. He, L. Song, Stochastic generative hashing, in: Proceedings of the International Conference on Machine Learning, 2017,

670

pp. 913–922.

CR IP T

[56] X. Liu, L. Huang, C. Deng, B. Lang, D. Tao, Query-adaptive hash code

ranking for large-scale multi-view visual search, IEEE Trans. Image Processing 25 (10) (2016) 4514–4524. 675

[57] X. Liu, L. Huang, C. Deng, J. Lu, B. Lang, Multi-view complementary hash tables for nearest neighbor search, in: IEEE International Conference

AN US

on Computer Vision, ICCV, 2015, pp. 1107–1115.

[58] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: Delving deep into convolutional nets, in: British Machine Vision Conference, 2014.

680

[59] J. Lu, J. Hu, Y. P. Tan, Discriminative deep metric learning for face and

4269–4282.

M

kinship verification, IEEE Transactions on Image Processing 26 (9) (2017)

ED

[60] K. Sohn, Improved deep metric learning with multi-class n-pair loss objective, in: Advances in Neural Information Processing Systems, 2016, pp.

685

1849–1857.

PT

[61] J. Lu, G. Wang, W. Deng, P. Moulin, J. Zhou, Multi-manifold deep metric learning for image set classification, in: IEEE Conference on Computer

CE

Vision and Pattern Recognition, 2015, pp. 1137–1145. 690

[62] M. J. Huiskes, M. S. Lew, The MIR flickr retrieval evaluation, in: Proceed-

AC

ings of ACM SIGMM International Conference on Multimedia Information Retrieval, 2008, pp. 39–43.

[63] H. J. Escalante, C. A. Hern´ andez, J. A. Gonz´ alez, A. L´ opez-L´ opez,

695

M. Montes-y-G´ omez, E. F. Morales, L. E. Sucar, L. V. Pineda, M. Grubinger, The segmented and annotated IAPR TC-12 benchmark, Computer Vision and Image Understanding 114 (4) (2010) 419–428. 38

ACCEPTED MANUSCRIPT

[64] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, F. Li, Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.

AC

CE

PT

ED

M

AN US

CR IP T

700

39

ACCEPTED MANUSCRIPT 1

Lei Ma received the B.Sc. degree in communication engineering from Hubei University in 2012, and he is currently working toward the Ph.D degree in signal and information processing in the Intelligent Visual Information Processing and Communication Laboratory (IVIPC) at University of Electronic Science and Technology of China (UESTC). His research interests include large scale multimedia indexing and

AC

CE

PT

ED

M

AN US

CR IP T

retrieval, computer vision, pattern recognition and machine learning.

May 8, 2017

DRAFT

ACCEPTED MANUSCRIPT 2

Hongliang Li (SM’12) received his Ph.D. degree in Electronics and Information Engineering from Xian Jiaotong University, China, in 2005. From 2005 to 2006, he joined the visual signal processing and communication laboratory (VSPC) of the Chinese University of Hong Kong (CUHK) as a Research Associate. From 2006 to 2008, he was a Postdoctoral Fellow at the same laboratory in CUHK. He is currently a Professor in the School of Electronic Engineering, University of Electronic Science and Technology of China. His research interests include image segmentation, object detection, image and video coding, visual attention, and multimedia communication system.

CR IP T

Dr. Li has authored or co-authored numerous technical articles in well-known international journals and conferences. He is a co-editor of a Springer book titled Video segmentation and its applications. Dr. Li was involved in many professional activities. He is a member of the Editorial Board of the Journal on Visual Communications and Image Representation, and the Area Editor of Signal Processing: Image Communication, Elsevier Science. He served as a Technical Program Co-chairs for VCIP2016 and ISPACS 2009, General co-chair of the ISPACS 2010, Publicity co-chair of IEEE VCIP 2013 local chair of the IEEE ICME 2014, and TPC members in a number of international conferences, e.g., ICME 2013, ICME 2012, ISCAS 2013, PCM 2007,

AC

CE

PT

ED

M

AN US

PCM 2009, and VCIP 2010. He is now a senior member of IEEE.

May 8, 2017

DRAFT

ACCEPTED MANUSCRIPT 3

Fanman Meng (S’12-M’14) received the Ph.D. degree in signal and information processing from University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2014. From July 2013 to July 2014, he joined Division of Visual and Interactive Computing of Nanyang Technological University in Singapore as a Research Assistant. He is currently an Associate professor in the School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China. His research interests include image segmentation and object detection. Dr. Meng has authored or co-authored numerous technical articles in well-known international journals and conferences. He

CR IP T

received the ”Best student paper honorable mention award” for the 12th Asian Conference on Computer Vision (ACCV 2014) in Singapore and the ”Top 10% paper award” in the IEEE International Conference on Image Processing (ICIP 2014) at Paris,

AC

CE

PT

ED

M

AN US

France. He is now a member of IEEE and IEEE CAS society.

May 8, 2017

DRAFT

ACCEPTED MANUSCRIPT 4

Qingbo Wu (S’12-M’13) received the B.E. degree in Education of Applied Electronic Technology from Hebei Normal University in 2009, and the Ph.D. degree in signal and information processing in University of Electronic Science and Technology of China in 2015. From February 2014 to May 2014, he was a Research Assistant with the Image and Video Processing (IVP) Laboratory at Chinese University of Hong Kong. Then, from October 2014 to October 2015, he served as a visiting scholar with the Image & Vision Computing (IVC) Laboratory at University of Waterloo. He is currently a lecturer in the School of Electronic Engineering, University of Electronic Science and Technology of China. His research interests include image/video

AC

CE

PT

ED

M

AN US

CR IP T

coding, quality evaluation, and perceptual modeling and processing.

May 8, 2017

DRAFT

ACCEPTED MANUSCRIPT 5

King N. Ngan (F’00) received the Ph.D. degree in Electrical Engineering from the Loughborough University in U.K. He is currently a chair professor at the Department of Electronic Engineering, Chinese University of Hong Kong. He was previously a full professor at the Nanyang Technological University, Singapore, and the University of Western Australia, Australia. He has been appointed Chair Professor 705

at the University of Electronic Science and Technology, Chengdu, China, under the National Thousand Talents Program since 2012. He holds honorary and visiting professorships of numerous universities in

China, Australia and South East Asia.

CR IP T

Prof. Ngan served as associate editor of IEEE Transactions on Circuits and Systems for Video Technology, Journal on Visual Communications and Image Representation, EURASIP Journal of Signal Processing: Image Communication, and Journal of Applied Signal Processing. He chaired and co-chaired a number of prestigious international conferences on image and video processing including the 2010 IEEE International Conference on Image Processing, and served on the advisory and technical committees of numerous professional organizations. He has published extensively including 3 authored books, 7 edited volumes, over 400 refereed technical papers, and edited 9 special issues in journals. In addition, he holds 15 patents in the areas of image/video coding and communications.

AN US

Prof. Ngan is a Fellow of IEEE (U.S.A.), IET (U.K.), and IEAust (Australia), and an IEEE Distinguished Lecturer in 2006-

AC

CE

PT

ED

M

2007.

May 8, 2017

DRAFT