Communicated by Dr. Deng Cheng
Accepted Manuscript
Global and Local Semantics-Preserving Based Deep Hashing for Cross-Modal Retrieval Lei Ma, Hongliang Li, Fanman Meng, Qingbo Wu, King Ngi Ngan PII: DOI: Reference:
S0925-2312(18)30635-0 10.1016/j.neucom.2018.05.052 NEUCOM 19613
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
17 October 2017 3 May 2018 17 May 2018
Please cite this article as: Lei Ma, Hongliang Li, Fanman Meng, Qingbo Wu, King Ngi Ngan, Global and Local Semantics-Preserving Based Deep Hashing for Cross-Modal Retrieval, Neurocomputing (2018), doi: 10.1016/j.neucom.2018.05.052
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Global and Local Semantics-Preserving Based Deep Hashing for Cross-Modal Retrieval
a School
CR IP T
Lei Maa,∗∗, Hongliang Lia,∗∗, Fanman Menga , Qingbo Wua , King Ngi Ngana of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, PR China
Abstract
AN US
Cross-modal hashing methods map similar data entities from heterogeneous data
sources to binary codes with smaller Hamming distance. However, most existing cross-modal hashing methods learn the hash codes with the hand-crafted features which will not generate optimal hash codes and achieve satisfactory performance. Deep cross-modal hashing methods integrate feature learning and hash coding into an end-to-end learning framework which have achieved promising
M
results. However, these deep cross-modal hashing methods do not well preserve the discriminative ability and the global multilevel similarity in hash learning
ED
procedure. In this paper, we propose a global and local semantics-preserving based deep hashing method for cross-modal retrieval. More specifically, a large margin is enforced between similar hash codes and dissimilar hash codes from
PT
an inter-modal view to learn discriminative hash codes. Therefore the learned hash codes can well preserve local semantic structure. Sequently, the supervised information with the global multilevel similarity is introduced to learn
CE
semantics-preserving hash codes for each intra-modal view. As a consequence, the global semantic structure can be preserved into the hash codes. Further-
AC
more, a consistent regularization constraint is added to generate unified hash codes. Ultimately, the feature learning procedure and the hash coding procedure ∗ Corresponding
author author Email addresses: leima
[email protected] (Lei Ma),
[email protected] (Hongliang Li) URL:
[email protected] (Fanman Meng),
[email protected] (Qingbo Wu),
[email protected] (King Ngi Ngan) ∗∗ Corresponding
Preprint submitted to NEUROCOMPUTING
May 25, 2018
ACCEPTED MANUSCRIPT
are integrated into an end-to-end learning framework. To verify the effectiveness of the proposed method, extensive experiments are conducted on several datasets, and the experimental results demonstrate that the proposed method
CR IP T
achieves superior performance. Keywords: Deep learning, metric learning, semantic preserving, cross-modal hashing.
1. Introduction
AN US
Nowadays the dramatic growth of multimedia data from internet click and mobile device usage have posed a great challenge to data storage and information indexing. With benefits of high retrieval efficiency and low storage cost, hashing 5
technology has attracted growing attention during the past few years. Hashing methods aim to map data points to compact binary codes while preserving the similarity in the original space. The storage cost can be reduced dramatically
M
with the binary representation, and the const or sub-linear search speed can be achieved with the fast Hamming distances computation (bit-wise XOR operator 10
and bit-count operations).
ED
Most learning to hash methods focus on unimodal hashing which requires queries and database entries to have a homogeneous feature space [1, 2, 3, 4, 5,
PT
6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. However, in real-world scenarios, data are often composed of multiple modalities. For example, an image on a website is usually 15
surrounded by some tags, textual descriptions and other contents. Cross-modal
CE
retrieval aims to query a data in one modality and retrieve relevant data in another modality. To improve the retrieval efficiency and speed, several cross-
AC
modal hashing methods have been widely explored.
20
Existing cross-modal hashing methods can be roughly divided into two main
sub-categories: unsupervised methods and supervised methods. The unsupervised methods only utilize the original multi-modal features to learn hash functions, and the representative methods include CHMIS [16], SM2 H [17], LSSH [18], CMFH [19], ACQ [20], STMH [21], MGCMH [22], CCQ [23] and CV-DMH
2
ACCEPTED MANUSCRIPT
[24]. The basic idea of these unsupervised cross-modal hashing methods is to dis25
cover the latent correlation of different modalities. The widely used techniques include canonical correlation analysis [20], manifold learning [16, 17, 21, 22, 24],
CR IP T
dictionary learning [17, 18], matrix factorization [18, 19, 21, 24], vector quantization [23]. Due to lack of support of manual annotations, the unsupervised cross-modal hashing methods usually fail to achieve satisfactory performance.
For supervised cross-modal hashing methods, the semantic labels are in-
30
corporated into learning hash functions to improve the cross-modal retrieval
performance. The representative supervised cross-modal hashing methods in-
AN US
clude: 1) traditional cross-modal hashing methods such as CH [25], SCM [26], QCH [27], SMFH [28], MDBE [29], SePH [30], RoPH [31], DCH [32], LRSH [33], 35
CSDH [34]; 2) deep cross-modal hashing methods such as DCMH [35], PRDH [36], DVSH [37], CDQ [38], CHN [39], DSH [40], TDH [41]. The supervised cross-modal hashing methods incorporate the manual annotations in the loss function to model the semantic correlations across different modalities.
40
M
For traditional supervised cross-modal hashing methods, the entire learning procedure can be separated into two independent steps, i.e., the feature
ED
extraction and the hash-code learning. Although these methods can achieve promising retrieval performance, they usually fail to generate optimal hash codes. The reasons can be two-fold: 1) their feature extraction procedure is
45
PT
independent of the hash-code learning procedure; 2) the hash functions to be learned are usually linear which cannot effectively preserve the structure of the
CE
learned hash codes. On the contrary, the deep cross-modal hashing methods integrate the feature learning procedure and hash-code learning procedure into an end-to-end learning framework with deep neural networks. In addition, the
AC
nonlinear activations for deep neural networks in theory can be helpful to fit
50
arbitrary complex functions. Generally speaking, the end-to-end deep learning architecture for cross-modal hashing consists of two neural networks with one for each modality. The image modality usually utilizes the convolutional neural network (CNN) [35, 36, 37, 38] and the text modality usually adopts the multilayer perceptron (MLP) [35, 36, 38] or recurrent neural network (RNN) [37]. 3
ACCEPTED MANUSCRIPT
55
This end-to-end deep learning architecture can map the images and texts into a common Hamming space in which the cross-modal retrieval is conducted. Existing deep cross-modal hashing methods treat semantic similar instances equally
CR IP T
while ignore the different similarity levels between them in the case of multiple labels, where the similarity level is relevant to the number of common labels 60
shared by pairwise similar instances. Consequently, the above deep cross-modal
hashing methods cannot well deal with the multilevel measure (such as very similar, normally similar and dissimilar) [42] based multi-label cross-modal retrieval task. In addition, existing deep cross-modal hashing methods can only
65
AN US
capture the local semantic structure while ignore the global semantic structure. Obviously, the local semantic structure is insufficient to represent the semantic structure of cross-modal data.
To enhance the discriminative ability while preserving the global multilevel semantic structure of hash codes, we propose a global and local semanticspreserving (GLSP) based deep hashing method for cross-modal retrieval. The framework of the proposed method is shown in Fig. 1. More specifically, we
M
70
utilize a metric learning method based on Hamming distance to preserve a
ED
large margin between similar data points and dissimilar data points across different modalities. Through the metric learning method, we can preserve the local semantic structure into the hash codes across different modalities. In addition, in order to preserve the global multilevel semantic structure in hash
PT
75
codes, we transform the similarity levels into a probability distribution and
CE
utilize this probability distribution to supervise the intra-modal hash learning. Subsequently, in order to further enhance the performance of cross-modal retrieval, a consistent regularization term is introduced to guarantee that the resulting hash codes are as consistent as possible across different modalities.
AC
80
Finally, we integrate the local inter-modal semantic structure preserving term, the global intra-modal semantic structure preserving term and the consistent regularization term into an end-to-end learning framework. The entire deep network model is jointly trained via an efficient optimization algorithm. Exten-
85
sive experiments conducted on various cross-modal datasets and comparisons 4
ACCEPTED MANUSCRIPT
Local Semantic Structure Preserving
Input
ʹʹͶ
ͳͳ
ͳͳ
ʹʹͶ
ͷͶ
ͷͶ
conv2
ͷ ͷ
ʹ ʹ
conv3
͵ ͵
Conv 5h5 /1 ͵ 11h11 /4 Pooling Pooling 3×3 /2 3×3 /2
Conv Ͷ
Ͷ
ʹ ʹ
Conv 3h3 /1 Pooling 3×3 /2
͵ ͵
conv4 ͳ͵
ͳ͵
͵ ͵
ʹͷ Conv ʹͷ 3h3 /1 Pooling 3×3 /2
1) a yellow building with white columns in the background; 2) two palm trees in front of the house; 3) cars are parking in front of the house; 4) a woman and a child are walking over the square;
1
µ1−τ1
1
µ1+τ1
-1
ʹͷ Conv 3h3 /1 Pooling 3×3 /2
L 4096
4096
Input
fc1
...... bow
2τ1
fcb
fcb
word2vec
Consistent Regularization
fc7
......
L 8192
Text Feature Extraction
-1
similar cross-modal data points
dissimilar cross-modal data points
Global Semantic Structure Preserving
AN US
1) the dark outline of trees and bushes in the foreground; 2) the setting sun and an orange sky in the background;
fc6
conv5
CR IP T
conv1
Figure 1: Overview of the proposed global and local semantics-preserving based deep hashing method for cross-modal retrieval. The entire network architecture constitutes two encoders including an image encoder and a text encoder. We integrate the local semantic structure preserving (metric learning), global semantic structure preserving and a consistent regularization term into an end-to-end learning framework to improve the discriminative ability of hash
M
codes while preserving the global multilevel semantic structure. In addition, we use different shapes and colors to represent dissimilar data points and data points from different modalities respectively. P, Qx , and Qy are the joint probability distribution from the ground-truth more details.
ED
labels, image hash codes, and text hash codes respectively. Please refer to the text part for
PT
with state-of-the-art approaches demonstrate the effectiveness of the proposed method.
The rest of this paper is organized as follows. We review related works
CE
in section 2. The proposed method is elaborated in section 3, including the
90
deep architecture, the local semantic similarity preserving for capturing intermodal correlations, the global semantic similarity preserving for capturing intra-
AC
modal correlations and the regularization loss. The optimization algorithm is introduced in section 4. Extensive experimental evaluations are provided in section 5. Finally, we draw a conclusion in section 6.
5
ACCEPTED MANUSCRIPT
95
2. Related Works Recently, a variety of cross-modal retrieval methods have been proposed, including real-valued cross-modal retrieval methods, quantization-based cross-
survey on cross-modal retrieval can be found in [43]. 100
2.1. Real-Valued Cross-Modal Retrieval Methods
CR IP T
modal retrieval methods and cross-modal hashing methods. A comprehensive
The key idea of real-valued cross-modal retrieval is to map multimedia data
into a common subspace where cross-modal multimedia retrieval can be per-
AN US
formed. Recently, many real-valued cross-modal retrieval methods have been proposed. For example, Rasiwasia et al. [44] learn a latent space between two 105
modalities with canonical correlation analysis (CCA). Deng et al. [45] learn the cross-modal embedding space based on discriminative dictionary learning with common label alignment. Liong et al. [46] design two neural networks to nonlinearly map the multimedia data from two different modalities into a shared
110
M
feature subspace. However, these methods often require substantial memory resources for cross-modal retrieval and the low query efficiency limits their scal-
ED
ability. Therefore, they are not suitable for large-scale search. 2.2. Quantization-Based Cross-Modal Retrieval Methods
PT
Recently, quantization techniques [47, 48, 49] use some quantizers to approximate original features and obtain good results for single-modal similarity retrieval task [50, 51, 52]. Consequently, some quantization based cross-modal
CE
115
hashing methods have been proposed. For example, Irie [20], Xie [22], Liu [25], Wu [27] and Jiang [35] learn hash functions and the hash codes via minimizing
AC
the binary quantization error. Long et al. [23] propose a composite correla-
tion quantization model to preserve both intra-modal similarity and inter-modal
120
correlation. Zhang et al. [53] learn the quantizers for both modalities jointly through aligning the quantized representations for each pair of image and text in the same article. Yang [54] and Cao [38] propose deep quantization methods for cross-modal similarity retrieval, which adopt additive quantization [48] 6
ACCEPTED MANUSCRIPT
with label alignment and composite quantization [49] respectively to learn the 125
quantizers for both modalities using two neural networks. However, these quantization based cross-modal hashing methods [23, 53, 54, 38] tend to be slightly
CR IP T
less efficient than Hamming search over binary codes [55]. 2.3. Cross-Modal Hashing Methods
Unsupervised cross-modal hashing methods [17, 19, 22, 56, 57] learn the hash 130
codes using the original multi-modal features without any supervision informa-
tion. For example, Wu et al. [17] learn sparse codesets by jointly optimizing the
AN US
multi-modal dictionary, sparse representations and hyper-graph based manifold regularization. Ding et al. [19] employ collective matrix factorization to lean unified binary codes of different modalities. Xie et al. [22] learn unified hash 135
codes by constructing multiple homogeneous manifolds, one for each modality. Liu et al. [56] propose a query-adaptive rank fusion method over multiple tables with multiple views. In real applications, unsupervised cross-modal hashing
M
methods can hardly achieve satisfactory retrieval performance.
Unlike unsupervised cross-modal hashing methods, supervised cross-modal hashing methods utilize supervised information to facilitate hash function learn-
ED
140
ing or hash-code learning. The supervised information for modeling semantic correlations in existing cross-modal hashing can be given in four different forms
PT
include point-wise supervision [29, 32], pair-wise supervision [25, 26, 27, 28, 33, 34, 35, 36, 37, 38, 39, 40], triplet-wise supervision [31] and structured supervision 145
[30].
CE
The point-wise supervision usually formulates the hash learning as a linear
classification problem. For example, Wang et al. [29] simultaneously learn hash
AC
codes for supervised classification and preserves category-specific features for hash codes. Xu et al. [32] learn the modal-specific hash functions via the ridge
150
penalty and generates unified hash codes for linear classification. Although the semantic labels are introduced, the semantic correlation between different instances is not fully exploited. Therefore, the semantic correlation will not be effectively reflected in the resulting hash codes. 7
ACCEPTED MANUSCRIPT
The pair-wise supervision usually formulates the hash learning as a pair155
wise fitting problem [26, 30, 31, 40], metric learning problem [28, 27, 37, 39] and pair-wise classification problem [33, 34, 35, 36, 38]. The pair-wise fitting
CR IP T
problem aims to fit the groundtruth similarities with the similarities of hash codes. For example, Zhang [26] and Liu [40] fit the semantic similarity matrix with the learned hash codes based on the minimum mean square error. The 160
metric learning problem aims to preserve the distance between similar hash
codes close and dissimilar hash codes far away in the Hamming space, where different distance metrics can be used such as Hamming distance, Euclidean
AN US
distance, cosine distance and so on. For example, Liu et al. [28] preserve the semantic similarities of hash codes based on Euclidean distance. Wu [27] 165
and Cao [37, 39] utilize the cosine distance between hash codes to measure the similarity in the Hamming space. The pair-wise classification problem aims to minimize the pair-wise classification error. For example, Liu et al. [34] learn hash functions by minimizing the weighted classification error for each bit with
170
M
a boosting strategy. Jiang [35], Yang [36], and Cao [38] maximize cross-modal correlations by optimizing the likelihood function based on pairwise constraints.
ED
The pair-wise supervised methods usually consider the local semantic structure, since they mainly focus on modeling the relationship between individual training instances. Therefore the global semantic structure such as the ranking relation
175
PT
and the distribution of the semantic relationships between training instances are ignored.
CE
The triplet-wise supervision [31, 41] usually formulates the hash learning as a ranking preserving problem. For example, Ding et al. [31] formulate the problem of triplet ranking in Hamming space as the binary regression problem.
AC
Structured supervision [30] formulates the hash learning as the preservation of
180
the structure of semantic similarity distribution. For example, Lin et al. [30] transform the semantic affinities into a probability distribution, and fits it with another one derived from pair-wise Hamming distances by minimizing their Kullback-Leibler divergence. Compared to the triplet supervision, structured supervision can sufficiently capture the distribution of the semantic relation8
ACCEPTED MANUSCRIPT
185
ships between training instances by introducing the global semantic structure. However, the distribution of the semantic relationships between training instances cannot completely reveal the local discriminating semantic structure of
CR IP T
the multimedia data. In this paper, we incorporate the local semantic structure and the global se190
mantic structure into the cross-modal hash learning. In addition, inspired by the
success of quantization-based hashing methods, we introduce a consistent quantization loss to generate unified hash codes for both modalities of any instance. Finally, the local inter-modal semantic structure preserving term, the global
195
AN US
intra-modal semantic structure preserving term and the consistent regularization term are integrated into an end-to-end learning framework. Experimental results demonstrate the promising effectiveness and efficiency of the proposed GLSP.
M
3. Proposed Method
3.1. Notation and Problem Formulation In this paper, the bold uppercase letter like S and bold lowercase letter like
ED
200
z are used to denote matrix and vector. Sij denotes the element at the position (i, j) of matrix S. Si∗ denotes the elements of matrix S at the i-th row, and S∗j
PT
denotes the elements of matrix S at the j-th column. Moreover, kSkF is used to denote the Frobenius norm of matrix S. sign(·) denotes an element-wise sign 205
function where sign(x) = 1 if x ≥ 0 and sign(x) = −1 otherwise.
CE
Assume that there are n pairs of image-text instances O = {oi }ni=1 , and
each instance oi = (xi , yi ) has a class label vector zi = [zi1 , zi2 , · · · , zil ] ∈ Rc×1 ,
AC
where the set X = {xi }ni=1 denotes the image modality, Y = {yi }ni=1 denotes
the text modality, and c is the number of categories. zij is equal to 1 if the j-th
210
label is relevant to the instance oi and 0 otherwise. In addition, we also define a similarity matrix S ∈ {0, 1}n×n of training instances where Sij = 1 if the
instances oi and oj are similar, and Sij = 0 otherwise. Here, the definition of similar or dissimilar instances is based on their class labels {zi }ni=1 . Specifically, 9
ACCEPTED MANUSCRIPT
if two instances oi and oj share at least one label, we can say that they are 215
similar. Otherwise, oi and oj are dissimilar. The similarity matrix S also builds the neighbor graph between two instances in the semantic space. In other words,
neighbors of oi .
CR IP T
oj is the k-nearest neighbors of oi if Sij = 1. Otherwise, oj is not the k-nearest The goal of the proposed method is to learn an image hash function hx (x) ∈
{−1, 1}L and a text hash function hy (y) ∈ {−1, 1}L , where L is the length of (x)
(y)
= h(x) (xi ) and bi
hash code. The binary codes bi
= h(y) (yi ) can be gen-
erated for query and database instances with the corresponding hash function. (x)
(x)
(y)
dH (bi , bj ) =
(y)
and bj
is defined as follows:
AN US
The Hamming distance between bi
1 (x) L 1 (x) T (y) (y) kb − bj k2F = − bi bj 4 i 2 2
(1)
The proposed cross-modal hashing method is based on two assumptions: 1) if inter-modal hash codes are discriminative enough, similar hash codes and dissimilar hash codes from an inter-modal view will be separated by a large
M
margin; 2) if intra-modal hash codes can preserve the structure of the semantic space well, the difference between the probability distribution in the semantic
ED
space and the probability distribution of each modality in the Hamming space will be small. Based on the above assumptions, the entire objective function
PT
can be given as follows:
J = J xy + αJ xx + βJ yy + γR
(2)
CE
where J xy is the local semantic structure preserving term for capturing inter-
220
modal correlations, J xx and J yy are the global semantic structure preserving
term for capturing intra-modal correlations, and R is a regularization term to
AC
generate the unified hash codes for both modalities. α, β and γ are tradeoff hyperparameters to balance the impacts of each term. 3.2. Deep Architecture
225
The deep architecture contains two encoders, i.e., an image encoder and a text encoder, which are same as the ones in [35]. The image encoder consists of 10
ACCEPTED MANUSCRIPT
eight layers the VGG-F [58] network where the first seven layers are from the VGG-F model including conv1 − conv5 and f c6 − f c7, the eight layer is a fullyconnected layer of L dimension. The text encoder is a multilayer perceptrons (MLP) which consists of two fully-connected layers with the output dimensions
CR IP T
230
of 8192 and L respectively. The activation functions for the first layer and the
second layer are ReLU and the identity function respectively. Through these
two deep neural networks, we can embed image and text feature into a common space.
3.3. Local Semantic Structure Preserving for Capturing Inter-Modal Correla-
AN US
235
tions
Local semantic structure preserving focuses on modeling the neighborhood relationship between individual data points. In other words, it refers to how to model the similarity relations in the Hamming space with the neighborhood graph S. Specifically, we expect the neighbor image-text pairs to have small
M
Hamming distances, while the non-neighbor image-text pairs to have large Hamming distances. Inspired by the successful application of metric learning in face
ED
verification [59], object recognition [60], image set classification [61], cross-modal matching [46], we would like to introduce the metric learning method into crossmodal hashing. However, most metric learning methods use Euclidean distance
PT
or cosine distance to measure the similarities between different data points. We will extend the metric learning method to Hamming space and reformulate the local semantic structure preserving as a metric learning problem. Specifically, (x)
CE
for each hash code bi
(the green circle in Fig. 1) in the image modality, we (y)
expect the hash code of similar instance bj
(the blue circle in Fig. 1) in the
AC
text modality to fall within a given Hamming radius µ1 − τ1 (µ1 > τ1 > 0) of (x)
(y)
bi , and the hash code of dissimilar instance bk
(the blue rectangle in Fig.
1) in the text modality to fall out of a given Hamming distance radius µ1 + τ1 . Therefore, a large margin 2τ1 will be generated between dissimilar instances
from the same modality and a large margin µ1 + τ1 will be generated between dissimilar instances from the different modalities as shown in Fig. 1. We define 11
ACCEPTED MANUSCRIPT
the following loss for penalizing the violation of the distance constraints above:
=
δ(dH (bi(x) , b(y) j ) ≥ µ1 − τ1 ), if Sij = 1
(x)
(y)
(3)
δ(dH (bi , bj ) ≤ µ1 + τ1 ), otherwise.
CR IP T
(x) (y) ϕ(bi , bj , Sij )
where δ(·) is an indicator function which returns 1 if the condition within the
brackets is satisfied and 0 otherwise. In Eqn. (3), a const penalty is given for
the pairwise samples which violate the distance constraints. Then the proposed large margin inter-modal hashing loss is defined as: =
n X n X
(x)
(y)
ϕ(bi , bj , Sij ).
i=1 j=1
(4)
AN US
J
xy
3.4. Global Semantic Structure Preserving for Capturing Intra-Modal Correlations
Global semantic structure preserving focuses on the preservation of the global semantic distribution. Meanwhile, the global semantic structure should
M
consider the multilevel similarity information between instances. In addition, the multi-modal instances are usually associated with multiple labels. To pre-
ED
serve such multilevel similarity between hash codes, [42] utilizes an adaptive weighted triplet loss. However, such a triplet loss cannot make full use of the global multilevel similarity information. In contrast, we utilize the entire seman-
PT
tic affinity matrix [30] of training instances as supervised information to learn the global semantic preserving hashing codes for each modality. Specifically, we define the semantic affinity matrix A ∈ Rn×n as the number of common labels
AC
CE
between two data points. The element Aij is given as follows: Aij = zTi zj .
(5)
Therefore, we can compute a joint probability distribution P in the semantic
space from the semantic affinity matrix A. The element pij is derived as follows: pij = Pn
i=1
A Pnij
j=1,j6=i
12
Aij
.
(6)
ACCEPTED MANUSCRIPT
240
Because we are only interested in modeling pairwise similarities, the value of pii Pn P is set to zero. Thus i=1 j=1,j6=i pij = 1.
In the low-dimensional Hamming space, we can utilize the Hamming distance
CR IP T
to derive the probability distributions Qx and Qy for the image modality and the
x text modality respectively. For example, we define qij as the joint probability
distribution for bxi and bxj in the image modality as follows: x
x
e−dH (bi ,bj ) . Pn x −dH (bx k ,bm ) k=1 m=1,m6=k e
x qij = Pn
(7)
x The value of qij determines how similar the two instances bxi and bxj are. Again,
AN US
x since we are only interested in modeling pairwise similarities, the value of qii is y set to zero. Similarly, we can define qij as follows: y
y
e−dH (bi ,bj ) . Pn y −dH (by k ,bm ) k=1 m=1,m6=k e
y qij = Pn
(8)
y where the value of qii is set to zero. In order to preserve the global multilevel
semantic affinities in the resulting hash codes of each modality, we propose to
M
minimize the Kullback-Leibler divergence between the joint probability distribution P in the semantic space and the joint probability distribution Qx (Qy )
ED
in the Hamming space:
PT
J xx = KL(P k Qx ) = J yy = KL(P k Qy ) =
n n X X
pij log
pij x qij
(9)
n n X X
pij log
pij y qij
(10)
i=1 j=1,j6=i
i=1 j=1,j6=i
CE
3.5. Consistent Regularization Term Considering the semantic consistency across different modalities, the pro-
AC
posed method tends to generate unified hash codes for both modalities of any instance. Therefore, we introduce the following regularization term into the
245
objective function. R = kB − Bx k2F + kB − By k2F where B is the unified hash codes for both modalities. 13
(11)
ACCEPTED MANUSCRIPT
4. Optimization The objective function involves an indicator function and discrete variables, where these factors lead to a discontinuous and non-convex optimization prob-
CR IP T
lem. The optimization problem in Eqn. (2) is difficult to solve directly. There-
fore, we need to develop an efficient optimization algorithm to optimize the objective function. Firstly, we substitute a convex surrogate for the indicator
function in Eqn. (3). Specifically, we replace the indicator function δ(x ≥ 0)
with the logistic loss `(x) = log(1 + ex ). The main reason for choosing this surrogate is that this function is continuous, monotone increasing and differ-
AN US
entiable everywhere. In practical applications, we adopt its equivalent transformation `(x) = x − log(σ(x)) in order to avoid numerical overflow, where σ(x) = 1/(1 + e−x ). Therefore, we can approximate the Eqn. (3) as follows: n n X X i=1 j=1
n X n X i=1 j=1 n X n X
Sij δ(
L 1 (x) T (y) − µ1 + τ1 − bi bj ≥ 0)+ 2 2
(1 − Sij )δ(µ1 + τ1 −
M
J xy =
L 1 (x) T (y) + bi bj ≥ 0) 2 2 (12)
ED
1 (x) T (y) ≈ Sij `(µ + τ − bi bj )+ 2 i=1 j=1
PT
n X n X 1 (x) T (y) (1 − Sij )`(−µ + τ + bi bj ). 2 i=1 j=1
where µ = Bx∗i
and
By∗i
L 2
− µ1 and τ = τ1 . Secondly, we approximate the hash codes
with the real-valued output of the corresponding neural network.
CE
Specifically, we replace the hash codes Bx∗i and By∗i with f (xi ; θx ) and g(yi ; θy )
AC
y x respectively. Therefore, J xy in Eqn. (12), qij in Eqn. (7), qij in Eqn. (8), and
14
ACCEPTED MANUSCRIPT
R in Eqn. (11) can be rewritten as: J xy =
n X n X
1 Sij `(µ + τ − FT∗i G∗j )+ 2 i=1 j=1
T
1
x qij = Pn
k=1
e 2 F∗i F∗j Pn m=1,m6=k
1
T
e 2 F∗k F∗m
.
T
1
e 2 G∗i G∗j . Pn 1 T 2 G∗k G∗m k=1 m=1,m6=k e
y = Pn qij
AN US
R = kB − Fk2F + kB − Gk2F .
(13)
CR IP T
n X n X 1 (1 − Sij )`(−µ + τ + FT∗i G∗j ). 2 i=1 j=1
(14)
(15) (16)
The optimization problem of Eqn. (2) can be solved by alternating optimization of θx , θy and B. These parameters can be optimized one by one with other 250
parameters fixed. The entire alternating procedure is illustrated in Algorithm 1. The detailed step by step derivations are given below.
M
4.1. Update B with θx and θy fixed
ED
If we fix θx and θy , the optimization subproblem can be rewritten as follows: min kB − Fk2F + kB − Gk2F = −2Tr(BT V) + const B
(17)
PT
s.t. B ∈ {−1, 1}c×n
where V = F + G, and const = 2nc + kFk2F + kGk2F . We can update B with
CE
the sign of V as
B = sign(V) = sign(F + G).
(18)
AC
4.2. Update θx with θy and B fixed Since θy and B are fixed, we can utilize stochastic gradient descent (SGD)
to update the neural network parameter θx of the image modality with the back-propagation (BP) algorithm. More specifically, we randomly sample a mini-batch of points from the image modality, and derive the gradients of the objective function w.r.t. the real-valued output of the neural network for image
15
ACCEPTED MANUSCRIPT
CR IP T
Algorithm 1 GLSP algorithm Input: The image set X = {xi }ni=1 , text set Y = {yi }ni=1 , class labels {zi }ni=1 , cross-modal pairwise similarity matrix S, the parameters α, β, γ, µ, τ .
Output: Binary code matrix B, parameters θx and θy of the deep neural networks for the image modality and the text modality. Initialization:
Initialize network parameters θx and θy , mini-batch size Nx = Ny = 128,
1:
repeat
AN US
and the iteration number tx = dn/Nx e, ty = dn/Ny e; 2:
Update B according to Eqn. (18);
3:
for iter = 1, 2, . . . , tx do
4:
Randomly sample a mini-batch of Nx images from X;
5:
For each sampled image xi in the mini-batch, calculate the output
M
x f (xi ; θx ), update the matrix F∗i = f (xi ; θx ) and the probability qij
according to Eqn. (14);
Backpropagate the gradients according to Eqn. (19) and update the
ED
6:
network parameter θx . end for
8:
for iter = 1, 2, . . . , ty do
9: 10:
PT
7:
Randomly sample a mini-batch of Ny texts from Y;
For each sampled text yi in the mini-batch, calculate the output
CE
y g(yi ; θy ), update the matrix G∗i = g(yi ; θy ) and the probability qij
according to Eqn. (15);
AC
11:
12: 13:
Backpropagate the gradients according to Eqn. (20) and update the network parameter θy .
end for until convergence or reaching the maximum number of iterations.
16
ACCEPTED MANUSCRIPT
modality. Then we can compute
∂J ∂θx
with the chain rule. Finally, stochastic
gradient descent is performed to update the neural network parameter θx of the image modality. In particular, for each real-valued output F∗i of the neural ∂J ∂F∗i
∂J ∂Jxy ∂Jxx ∂R = +α +γ ∂F∗i ∂F∗i ∂F∗i ∂F∗i where
as follows:
CR IP T
network for image modality, we compute the gradient
(19)
n X ∂Jxy 1 =− Sij σ(µ + τ − FT∗i G∗j )G∗j + ∂F∗i 2 j=1 n X
AN US
1 (1 − Sij )σ(−µ + τ + FT∗i G∗j )G∗j 2 j=1
∂Jxx x = (qi∗ − pi∗ )FT ∂F∗i ∂R = 2(F∗i − B∗i ) ∂F∗i
(20) (21) (22)
M
4.3. Update θy with θx and B fixed
When θx and B are fixed, we use SGD to update the neural network pa-
255
ED
rameter θy of the text modality with the back-propagation (BP) algorithm. More specifically, for each real-valued output G∗i of the neural network for text
PT
modality, we compute the gradient
∂J ∂G∗i
as follows:
∂J ∂Jxy ∂Jxx ∂R = +α +γ ∂G∗i ∂G∗i ∂G∗i ∂G∗i
(23)
AC
CE
where
n X 1 ∂Jxy =− Sij σ(µ + τ − GT∗i F∗j )F∗j + ∂G∗i 2 j=1 n X
1 (1 − Sij )σ(−µ + τ + GT∗i F∗j )F∗j 2 j=1
∂Jxx y = (qi∗ − pi∗ )GT ∂G∗i ∂R = 2(G∗i − B∗i ) ∂G∗i
(24) (25) (26)
17
ACCEPTED MANUSCRIPT
4.4. Implementation Tricks In order to take full advantage of the whole training set, the real-valued
260
outputs of the neural networks for image modality and text modality are stored
CR IP T
in two matrices (i.e., F and G) respectively. In the training procedure, we
need to divide the whole training set into mini-batches. Then in each iteration, we randomly feed a mini-batch of points to the neural network and utilize the 265
real-valued outputs of the neural network to update the corresponding matrix. According to Algorithm 1, the corresponding parameters can be updated in each
iteration. It is worth noting that nT pairwise data points are incorporated into
AN US
the updating of the neural network in each iteration, where T is the batch size
and n is the number of training instances. Although it takes a slightly higher 270
computation and memory cost, the introduction of the whole training instances can boost the proposed method more robust against noise and outliers. y x The computation qij and qij in Eqn. (14) and Eqn. (15) involves all pairwise
data points in each iteration. Thus there are a large mount of repeated com-
275
M
y x putations in the computation of qij and qij in each iteration. For example, in
(t + 1)-th iteration, we randomly sample T images and feed them to the neural
ED
network of image modality. In addition, we use idx1 to denote the indices of the T training images and idx2 to denote the indices of the remaining training
PT
x images. The denominator dt+1 of qij is as follows:
X
AC
CE
dt+1 =
where
Ft+1 ∗m
=
X
1
t+1 T
e 2 F∗k
Ft+1 ∗m
+
k∈idx1 m∈idx1 ,m6=k
2
X
X
1
t+1 T
e 2 F∗k
Ft+1 ∗m
+
(27)
k∈idx1 m∈idx2
X
X
1
t+1 T
e 2 F∗k
Ft+1 ∗m
k∈idx2 m∈idx2 ,m6=k
Ft∗m ,
if m ∈ idx2 . Therefore, the third term in dt+1 is computed
repeatedly, since it has been computed in dt . In order to reduce the computation
18
ACCEPTED MANUSCRIPT
cost, we can iteratively compute dt+1 as follows: X
X
t
1
e 2 F∗k
T
Ft∗m
k∈idx1 m∈idx1 ,m6=k
X
2
X
1
t
e 2 F∗k
T
Ft∗m
+
k∈idx1 m∈idx2
X
X
1
t+1 T
e 2 F∗k
Ft+1 ∗m
k∈idx1 m∈idx1 ,m6=k
2
X
X
1
t+1 T
e 2 F∗k
k∈idx1 m∈idx2
−
CR IP T
dt+1 = dt −
Ft+1 ∗m
(28)
+
The computation cost can decrease from (n2 − n)/2 pairs to T 2 − T + 2nT pairs if the symmetric property of inner product is used.
AN US
280
5. Experiments
In this section, we conduct experiments of cross-modality search on two benchmarks to validate the effectiveness of the proposed method. Each of the
285
M
two benchmarks consists of two modalities: image and text. In the following, we firstly introduce the experiment settings including the datasets, the baseline
ED
methods and the evaluation criteria. Then, in order to make fair comparisons, we will give the presentation and discussion of experimental results. Finally, we further investigate the parameter sensitivity of the proposed method. All
290
PT
the following experiments are run on a server with Intel(R) Xeon(R) E5-2620
[email protected] CPU, 128GB RAM and a NVIDIA TITAN Xp GPU with 12GB
CE
memory.
5.1. Experiment Settings
AC
5.1.1. Datasets
295
The proposed method is evaluated on two benchmark multi-modal datasets,
i.e., MIRFlickr-25K [62] and IAPRTC-12 [63]. Some statistics of these two datasets are introduced as follows: • MIRFlickr-25K consists of 25,000 image-text pairs which are annotated with some of 24 concepts. Several textual tags are provided for each image. 19
ACCEPTED MANUSCRIPT
Following the settings in [35], we remove the textual tag if its frequency is less than 20 in the dataset. Subsequently, we get 20,015 image-text pairs
300
for experiment. The text modality is represented as a 1386-dimensional
CR IP T
bag-of-words feature. For the hand-crafted feature based method, we use a 4096-dimensional CNN feature vector to represent each image. The dataset is randomly split into a query set of 2,000 image-text pairs and
a database set of 18,015 image-text pairs. We randomly sample 10,000
305
image-text pairs from the database instances as training instances.
• IAPRTC-12 consists of 20,000 images and each image is connected to
AN US
several descriptive sentences. Each instance is annotated with 275 pre-
defined labels. The size of the vocabulary is 4670. Therefore we use a 4670-dimensional bag-of-words vector to represent the text for each in-
310
stance. For the hand-crafted feature based cross-modal hashing methods, we use a 4096-dimensional CNN feature vector to represent each image.
M
The entire dataset is exploited for our experiment. The dataset is randomly split into a query set of 2,000 image-text pairs and a database of 18,000 image-text pairs. We randomly sample 10,000 image-text pairs
315
ED
from the database instances as training instances. 5.1.2. Baseline Methods
PT
We compare the proposed method with the following state-of-the-art crossmodal hashing methods. • Semantic Correlation Maximization (SCM) [26] learns hash functions via
CE
320
maximizing the semantic correlation between the two modalities with re-
AC
spect to the semantic labels. In the following experiments, we select the
325
sequential learning since it usually performs better than the direct eigendecomposition. • Semantics-Preserving Hashing (SePH) [30] learns unified binary codes via the Kullback-Leibler divergence between a probability distribution constructed with the pairwise similarity matrix and a estimated one con20
ACCEPTED MANUSCRIPT
structed with to-be-learnt hash codes, and the hash functions can be learnt via RBF kernel logistic regression. • Rank-order Preserving Hashing (RoPH) [31] learns hash functions via in-
330
CR IP T
tegrating a rank-order preserving loss with a ridge regression loss.
• Linear Subspace Ranking Hashing (LSRH) [33] learns hash functions via maximizing the rank correlation between two ranking subspaces.
• Deep Cross-Modal Hashing (DCMH) [35] learns high non-linear mapping functions via integrating feature learning and hash-code learning into an
335
AN US
end-to-end learning framework.
• Pairwise Relationship Deep Hashing (PRDH) [36] learns hash functions via integrating different types of pairwise constraints and additional decorrelation constraints to preserve the similarities and enhance the discriminative
5.1.3. Evaluation Criteria
M
ability of the hash codes respectively.
340
The multi-modal instances are usually associated with multiple labels and
ED
their similarity level to the query can give the refined ranking quality. Therefore, in the following experiments, the three commonly-used ranking metrics including Normalized Discounted Cumulative Gain (NDCG) [31], Average Cu-
PT
345
mulative Gain (ACG) [31], and weighted mean Average Precision (mAPw ) [31] are used to evaluate the ranking quality. These three metrics are defined as
CE
follows:
(1) NDCG. Given a single query q and a ranking list of p retrieved instances,
the NDCG score is defined as follows:
AC
350
N DCG@p =
p 1 X 2ri − 1 Z i=1 log(i + 1)
(29)
where ri is the similarity level of the i-th retrieved instance in the ranking list, i.e., the number of common labels shared between the query and the ith retrieved instance. Z is a normalization factor to ensure that the NDCG 21
ACCEPTED MANUSCRIPT
score equals to one for the correct ranking. The NDCG scores of all queries are 355
averaged in our evaluation. (2) ACG. The definition of ACG score is given as follows: 1X ri . p i=1
CR IP T
p
ACG@p =
(30)
Similarly the ACG scores of all queries are also averaged to evaluate the ranking quality.
(3) The weighted MAP. The definition of the weighted MAP is given as follows:
Q
APω (q) =
1 X APω (q) Q q=1 1
with
AN US
M APω @p =
pr>0
p X
(31)
δ(rt > 0)ACG@t,
t=1
where Q is the number of queries and pr>0 is the number of relevant data points in the ranking list.
Besides the three ranking metrics mentioned above, Mean average precision
M
360
(MAP) [7] in the Hamming ranking protocol and retrieval precisions within
ED
Hamming radius 2 (PH2) [7] in the hash lookup protocol of all the baselines are also reported in the following experiments.
PT
5.1.4. Implementation Details
Following the settings in [35], the first seven layers of the CNN for image
365
modality is initialized with the VGG-F [58] model which is pre-trained on Ima-
CE
geNet dataset [64]. In addition, we randomly initialize all the other parameters of the deep architecture. For both datasets, the mini-batch size and the number
AC
of outer-loop in Algorithm 1 are fixed to 128 and 500 respectively. We decrease
370
the learning rate from 10−1.5 to 10−3 evenly in the log space for each epoch (outer-loop). In the following experiments, µ and γ are empirically set to 0 and
1 respectively for all datasets. Subsequently, we will investigate the impact of α, β and τ on algorithm performance. For the comparing methods, experimental parameters are set according to the suggestions provided in the original papers.
22
ACCEPTED MANUSCRIPT
Table 1: The comparison of NDCG@100, ACG@100 and MAPw @500 on MIRFlickr-25K dataset from 16 bits to 128 bits. MIRFlickr-25K Method
Text query
32 bits
64 bits
128 bits
NDCG@100
ACG@100
MAPw @500
NDCG@100
ACG@100
MAPw @500
NDCG@100
ACG@100
MAPw @500
NDCG@100
ACG@100
MAPw @500
LSRH [33]
0.2326
1.2923
1.3152
0.2292
1.2543
1.2876
0.2493
1.3855
1.3966
0.2522
1.3621
1.3656
1.5431
0.3024
1.5709
1.5587
1.7825
0.3725
1.8399
1.8107
1.8174
0.3743
1.8668
1.8367
1.6823
0.3412
1.7322
1.7187
1.7223
0.3644
1.7973
1.7791
2.0729
0.4346
2.1219
2.0848
1.5915
0.3209
1.7240
1.7095
1.6095
0.3225
1.6419
1.6172
1.7519
0.3683
1.7875
1.7599
1.7121
0.3604
1.7562
1.7385
1.7660
0.3623
1.8274
1.8313
1.8352
0.3807
1.8958
1.8838
2.0307
0.4467
2.1309
2.0897
SCM [26]
0.2863
1.5127
1.4958
0.2984
1.5513
1.5350
0.2997
1.5552
SePH [30]
0.3287
1.6865
1.6728
0.3532
1.7686
1.7431
0.3643
1.8104
RoPH [31]
0.3218
1.6800
1.6625
0.3495
1.7790
1.7583
0.3666
1.8450
DCMH [35]
0.3060
1.6103
1.5908
0.3235
1.6612
1.6468
0.3342
1.7006
PRDH [36]
0.3182
1.6341
1.6203
0.3305
1.6827
1.6696
0.3472
1.7400
GLSP
0.3926
1.9353
1.9029
0.4099
2.0279
2.0026
0.4336
2.1180
LSRH [33]
0.2626
1.4484
1.4538
0.2735
1.4717
1.4720
0.2965
1.5910
SCM [26]
0.3012
1.5797
1.5529
0.3131
1.6062
1.5833
0.3205
1.6330
SePH [30]
0.3336
1.6565
1.6422
0.3553
1.7302
1.7106
0.3621
1.7786
RoPH [31]
0.3221
1.6125
1.6090
0.3469
1.7177
1.7015
0.3598
1.7265
DCMH [35]
0.3395
1.7303
1.7270
0.3415
1.7284
1.7420
0.3488
1.7604
PRDH [36]
0.3447
1.7624
1.7586
0.3538
1.7824
1.7738
0.3665
1.8441
GLSP
0.3527
1.8474
1.8539
0.4105
2.0205
1.9909
0.4219
2.0555
CR IP T
Image query
16 bits
AN US
Type
Table 2: The comparison of NDCG@100, ACG@100 and MAPw @500 on IAPRTC-12 dataset from 16 bits to 128 bits.
IAPRTC-12
Method
Text query
MAPw @500
NDCG@100
LSRH [33]
0.1155
0.6251
0.6332
0.1358
SCM [26]
0.0911
0.5195
0.5176
0.1364
SePH [30]
0.2137
0.9571
0.9365
0.2458
RoPH [31]
0.2046
0.9326
0.9136
0.2412
DCMH [35]
0.2008
0.9424
0.9127
0.2347
PRDH [36]
0.2031
0.9496
0.9391
0.2180
GLSP
0.2363
1.0983
1.0534
0.2699
LSRH [33]
0.1215
0.6358
0.6419
SCM [26]
0.1202
0.6091
0.5995
SePH [30]
0.2147
0.9725
0.9505
RoPH [31]
0.2107
0.9385
0.9151
0.8786
0.8494
64 bits
128 bits
ACG@100
MAPw @500
NDCG@100
ACG@100
MAPw @500
NDCG@100
ACG@100
MAPw @500
0.7146
0.7112
0.1385
0.7276
0.7206
0.1426
0.7462
0.7455
0.6627
0.6495
0.1401
0.6867
0.6654
0.1460
0.7025
0.6806
1.0603
1.0269
0.2630
1.0976
1.0550
0.2794
1.1383
1.0924
1.0344
1.0042
0.2572
1.1052
1.0703
0.2680
1.1360
1.0967
1.0717
1.0354
0.2606
1.1583
1.1174
0.2137
1.0118
0.9987
1.0035
0.9849
0.2451
1.0872
1.0585
0.2556
1.1215
1.0899
1.1644
1.1180
0.3115
1.2953
1.2359
0.3304
1.3555
1.2906
0.1452
0.7471
0.7406
0.1489
0.7561
0.7487
0.1575
0.7986
0.7886
0.1607
0.7530
0.7246
0.1643
0.7595
0.7309
0.1752
0.8002
0.7660
0.2412
1.0486
1.0178
0.2750
1.1349
1.0847
0.2930
1.1793
1.1222
0.2369
1.0371
1.0068
0.2657
1.1085
1.0714
0.2803
1.1543
1.1119
DCMH [35]
0.1811
0.9668
0.9451
0.2547
1.1254
1.0892
0.2655
1.1638
PRDH [36]
0.2447
1.0916
1.0598
0.2658
0.2068
1.1604
1.1237
0.2864
1.2203
1.1798
0.3002
1.2652
1.2190
GLSP
0.2104
0.9345
0.9173
0.2435
1.0611
1.0251
0.2901
1.2311
1.1804
0.3134
1.2901
1.2376
1.1304
We carefully implement PRDH on MatConvNet using the same network model
PT
375
32 bits
ACG@100
M
Image query
16 bits NDCG@100
ED
Type
in this paper. The parameters are carefully tuned and set according to the
CE
original paper.
5.2. Results and Analyses
AC
5.2.1. Results on MIRFlickr-25K
380
The experimental results including NDCG@100, ACG@100 and MAPw @500
of the proposed method and other methods on MIRFlickr-25K dataset are presented in Table 1. We can see that the proposed method outperforms the baselines, including non-deep supervised cross-modal hashing methods with the CNN features (SCM [26], SePH [30], LSRH [33] and RoPH [31]) and the deep
23
ACCEPTED MANUSCRIPT
Table 3: The comparison of MAP on MIRFlickr-25K and IAPRTC-12 datasets from 16 bits to 128 bits. MIRFlickr-25K
IAPRTC-12
Method 16 bits
32 bits
64 bits
128 bits
16 bits
LSRH [33]
0.6611
0.6716
0.6822
0.6909
0.3977
SCM [26]
0.6798
0.6890
0.6962
0.6998
0.3640
SePH [30]
0.6977
0.6782
0.6851
0.6895
0.4782
RoPH [31]
0.7038
0.7139
0.7230
0.7264
0.4638
DCMH [35]
0.7399
0.7519
0.7555
0.7644
0.4773
Image query
0.7344
0.7495
0.7551
0.7659
0.4831
GLSP
0.7464
0.7706
0.7927
0.7989
0.5040
LSRH [33]
0.6809
0.6859
0.6997
0.7121
0.4201
SCM [26]
0.6737
0.6824
0.6820
0.6854
0.3452
0.6966
0.7033
0.7373
0.7392
0.7990
0.7166
0.6912
0.7180
0.7298
DCMH [35]
0.7832
0.7980
PRDH [36]
0.7834
0.7941
GLSP
0.7538
0.7766
128 bits
0.4274
0.4386
0.4059
0.4084
0.4197
0.4933
0.4636
0.4732
0.4899
0.5076
0.5166
0.5039
0.5276
0.5437
0.5101
0.5294
0.5443
0.5360
0.5764
0.6017
0.4357
0.4483
0.4591
0.3587
0.3580
0.3632
0.4821
0.4995
0.4693
0.4779
0.4668
0.4951
0.5139
0.5239
0.8032
0.5306
0.5535
0.5804
0.5913
0.8006
0.8058
0.5359
0.5675
0.5870
0.5981
0.8026
0.8046
0.4825
0.5198
0.5653
0.5936
M
SePH [30] RoPH [31]
Text query
64 bits
0.4210
AN US
PRDH [36]
32 bits
CR IP T
Type
Table 4: The comparison of PH2 on MIRFlickr-25K and IAPRTC-12 datasets from 16
ED
bits to 128 bits.
MIRFlickr-25K
Type
32 bits
64 bits
128 bits
16 bits
32 bits
64 bits
128 bits
LSRH [33]
0.7649
0.8394
0.8858
0.9256
0.4940
0.5715
0.6398
0.7155
SCM [26]
0.7230
0.7848
0.8112
0.8329
0.3706
0.4989
0.5318
0.7567
SePH [30]
0.7991
0.9377
0.9692
0.9792
0.6605
0.8053
0.9416
0.9895
RoPH [31]
0.7948
0.8504
0.8686
0.9567
0.6371
0.8134
0.9830
0.9983
DCMH [35]
0.8531
0.9114
0.9487
0.9302
0.8531
0.9114
0.9487
0.9302
PRDH [36]
0.8641
0.9072
0.9661
1.0000
0.8641
0.9072
0.9661
1.0000
PT
16 bits
AC
CE
Image query
Text query
IAPRTC-12
Method
GLSP
0.9577
0.9496
1.0000
0.0000
0.9230
1.0000
0.0000
0.0000
LSRH [33]
0.7523
0.8344
0.8972
0.9839
0.4863
0.5678
0.6029
0.5653
SCM [26]
0.7570
0.8004
0.8515
0.9142
0.3769
0.4643
0.4673
0.3460
SePH [30]
0.8130
0.9320
0.9511
0.9611
0.6680
0.8112
0.9422
0.9927
RoPH [31]
0.7876
0.8469
0.8818
0.9443
0.6486
0.8297
0.9756
0.9993
DCMH [35]
0.8373
0.8946
0.9521
1.0000
0.8373
0.8946
0.9521
1.0000
PRDH [36]
0.8665
0.9043
0.9727
1.0000
0.8665
0.9043
0.9727
1.0000
GLSP
0.9755
0.9678
0.0000
0.0000
0.9358
0.0000
0.0000
0.0000
24
ACCEPTED MANUSCRIPT
385
supervised cross-modal hashing methods (DCMH [35] and PRDH [36]). The improvement can be attributed to the following reasons: 1) the introduction of metric learning can generate a large margin between similar and dissimilar hash
CR IP T
codes; 2) the introduction of supervised information with multilevel similarity can guide to preserving the global multilevel semantic structure into hash codes; 390
3) the end-to-end learning framework can enhance the feedback between the
hashing learning procedure and the feature learning procedure. SePH and RoPH perform better than SCM, LSRH, and DCMH in most case. The reason can be that SePH and RoPH can learn unified binary codes for different modalities
AN US
which can enhance the correlations across different modalities.
The MAP performance of the proposed methods and other methods on
395
MIRFlickr-25K dataset are presented in Table 3. For image query, the proposed method can achieve higher MAP performance than other baselines. For text query, the proposed method can achieve competitive MAP performance with DCMH and PRDH when using more bits (e.g., 64-bit and 128-bit). The PH2 performance of the proposed methods and other baselines on MIRFlickr-
M
400
25K dataset are presented in Table 4. For image query and text query, the
ED
proposed method performs better than other baselines from 16 bits to 64 bits and 16 bits to 32 bits respectively. Then, the precision values of the proposed method decrease sharply with the increasing number of hashing bits. The reason is that the Hamming space becomes increasing sparse when using longer
PT
405
codes and few data points fall within the Hamming ball with radius 2.
CE
5.2.2. Results on IAPRTC-12 The experimental results including NDCG@100, ACG@100 and MAPw @500
AC
of the proposed method and other methods on IAPRTC-12 dataset are pre-
410
sented in Table 2. For image query, we can see that the proposed method outperforms the baselines under different ranking metrics. For text query, PRDH performs better than the proposed method and the other baselines from 16 bits to 32 bits. It can be attributed to the introduction of the intra-modal similarity preservation and the intra-modal decorrelation constraints. In addition, 25
ACCEPTED MANUSCRIPT
415
SePH performs better than the proposed method and the other baselines except PRDH at 16 bits. The reason can be attributed to the kernel method for the text feature extraction. Moreover, SePH performs better than the other baselines
CR IP T
for most cases. The reason can be that SePH incorporates the global multilevel similarity into the resulting hash codes. From another point of view, it implies 420
that the introduction of multilevel similarity can improve the performance of cross-modal hashing for multi-label retrieval.
The MAP performance of the proposed methods and other methods on
IAPRTC-12 dataset are presented in Table 3. For image query, the pro-
425
AN US
posed method can achieve higher MAP performance than other baselines. For
text query, the proposed method can achieve competitive MAP performance with DCMH and PRDH when using more bits (e.g., 128-bit). The PH2 performance of the proposed methods and other baselines on IAPRTC-12 dataset are presented in Table 4. For image query and text query, the proposed method performs better than other baselines from 16 bits to 32 bits and 16 bits respectively. Then, the precision values of the proposed method decrease sharply with
M
430
the increasing number of hashing bits.
ED
5.3. Effect Analysis of Each Component The proposed method mainly consists of three components: 1) metric learn-
435
PT
ing for capturing inter-modal correlations; 2) semantic preserving hash learning for capturing intra-modal correlations; 3) consistent regularization constraint. In this subsection, we will investigate the effect of each component. Therefore,
CE
we define the following three alternative baselines: 1. GLSP-1: training the network model without margin, i.e., τ = 0.
AC
2. GLSP-2: training the network model without the global semantic struc-
440
ture preserving, i.e., J xy + γR. 3. GLSP-3: training the network model without consistent regularization constraint, i.e., J xy + αJ xx + βJ yy . The experimental results of these baselines on MIRFlickr-25K and IAPRTC-
12 datasets are shown in Table 5 and Table 6 respectively. We can see that 26
ACCEPTED MANUSCRIPT
Table 5: The effect of different component on MIRFlickr-25K dataset from 16 bits to 128 bits. MIRFlickr-25K Method
16 bits
32 bits
64 bits
NDCG@100
ACG@100
MAPw @500
NDCG@100
ACG@100
MAPw @500
NDCG@100
ACG@100
GLSP-1
0.3674
1.8908
1.8555
0.3744
1.9370
1.8902
0.3874
1.9402
GLSP-2
0.3563
1.8045
1.7697
0.3999
1.9969
1.9724
0.4232
2.0987
GLSP-3
0.3558
1.7526
1.7078
0.3487
1.7660
1.7565
0.3907
1.9677
GLSP
0.3926
1.9353
1.9029
0.4099
2.0279
2.0026
0.4336
2.1180
GLSP-1
0.3299
1.5374
1.5270
0.3107
1.5239
1.5219
0.3950
1.9612
GLSP-2
0.3422
1.7750
1.7749
0.4011
1.9662
1.9674
0.4041
2.0384
GLSP-3
0.3270
1.6390
1.6202
0.3808
1.9044
1.8789
0.4191
2.0354
GLSP
0.3527
1.8474
1.8539
0.4105
2.0205
1.9909
0.4219
2.0555
Image query
NDCG@100
ACG@100
MAPw @500
1.9044
0.4032
2.0174
1.9778
2.0676
0.4294
2.1012
2.0573
1.9329
0.4327
2.1203
2.0818
2.0729
0.4346
2.1219
2.0848
1.9243
0.3960
1.9204
1.9051
2.0287
0.4245
2.0606
2.0444
1.9899
0.4444
2.1185
2.0837
2.0307
0.4467
2.1309
2.0897
AN US
Text query
128 bits MAPw @500
CR IP T
Type
Table 6: The effect of different component on IAPRTC-12 dataset from 16 bits to 128 bits. IAPRTC-12
Type
Method
16 bits
32 bits
128 bits
ACG@100
MAPw @500
NDCG@100
ACG@100
MAPw @500
NDCG@100
ACG@100
MAPw @500
NDCG@100
ACG@100
MAPw @500
GLSP-1
0.2110
0.9766
0.9496
0.2378
1.0610
1.0256
0.2598
1.1504
1.1049
0.2735
1.1834
1.1455
GLSP-2
0.2218
1.0126
0.9721
0.2580
1.1187
1.0760
0.3046
1.2703
1.2131
0.3223
1.3277
1.2655
GLSP-3
0.2023
0.9075
0.8827
0.2554
1.0966
1.0523
0.2810
1.1701
1.1160
0.3214
1.3093
1.2460
GLSP
0.2363
1.0983
1.0534
0.2699
1.1644
1.1180
0.3115
1.2953
1.2359
0.3304
1.3555
1.2906
GLSP-1
0.1910
0.8854
0.8703
GLSP-2
0.2049
0.9252
0.9058
GLSP-3
0.1901
0.8633
0.8371
GLSP
0.2104
0.9345
0.9173
0.2235
1.0392
1.0049
0.2533
1.1182
1.0810
0.2719
1.1742
1.1304
0.2311
1.0247
0.9919
0.2809
1.1770
1.1381
0.3081
1.2793
1.2284
0.2164
0.9670
0.9373
0.2390
1.0264
0.9890
0.2967
1.2199
1.1659
0.2435
1.0611
1.0251
0.2901
1.2311
1.1804
0.3134
1.2901
1.2376
ED
Text query
M
Image query
445
64 bits
NDCG@100
the local semantic structure preserving (metric learning) (GLSP-1) contributes
PT
the highest retrieval performance followed by the consistent regularization constraint (GLSP-3) and the global semantic structure preserving (GLSP-2). The reason can be attributed to the local semantic structure preserving can provide
CE
a more accurate similar or dissimilar discrimination for two data points from
450
different modalities. Thus it can ensure that similar data points rank before dis-
AC
similar data points as much as possible. Finally, the best retrieval performance can be achieved when all the three terms are combined together. 5.4. Parameter Sensitivity Analysis There are five parameters in the proposed method, including α, β, γ, µ and τ .
455
The parameters γ and µ are fixed for all experiments. Therefore, we will only
27
ACCEPTED MANUSCRIPT
analyze the effect of the rest parameters (α, β and τ ) on the experimental performance in this subsection. More specifically, we conduct a series of experiments in which we change one parameter while fixing the other parameters, and report
460
CR IP T
the results in Fig. 2 and Fig. 3. From the results, we have the following observations. (1) In most cases, the proposed method cannot obtain the best results when α is set to large values; (2) In most cases, we cannot obtain the best results
when β is set to small values; (3) The results are usually not satisfactory when the margin parameter τ is set to too large or too small values. α and β control
the significance of each modality for preserving the global multilevel similarity. A large α may break the balance of all components. In addition, a small β may
AN US
465
not well preserve the multilevel similarity in the hash codes of text modality. τ controls the margin between similar hash codes and dissimilar hash codes across different modalities. A small τ cannot well preserve the discriminative ability of hash codes. A large τ will introduce large amount of hard pairwise examples,
10 -4
10
1
10 2
10 -4
10 -2
10 4
1
10 2
MIRFlickr-25K @ 64 bits
10 -4
10
1
10 2
1
10 2
α
β
NDCGt2i @ 100
0.2 10 -4
0.4 0.3 0.2
10 -4
10 -2 1 10 2
10 4
1
7
1
10 2
MIRFlickr-25K @ 64 bits
0.2
10 -4
10 -2
1
β
10 4
1
5
3
τ
7
10
7
1 10 2
1
1
MIRFlickr-25K @ 64 bits
10 4
1
5
3
7
10 -4 1 10 2
β
10 4
1
5
3
τ
7
10
10 4
1
5
3
7
1 10 2
β
10 4
1
5
3
7
10
wMAPt2i @ 500
1 10 2
1 10 4
1
5
3
7
10 4
1
5
3
τ
7
10
10 2
1 10 -2 1 10 2 10 4
α
τ
1
5
3
7
10
τ
MIRFlickr-25K @ 64 bits
2.5
2 1.5 1
2 1.5 1 10 -4
10 -2 1 10 2
β
10 4
1
5
3
τ
7
10
10 -2 1 10 2
β
10 4
1
5
3
7
τ
Figure 2: The experimental performances for the variations in the value of hyper-parameters on the MIRFlickr-25K dataset. The hash code length is fixed at 64 bits.
28
10 4
β
2
10
10 -4 1
1
1.5
MIRFlickr-25K @ 64 bits
10 -2
10 -2
10 -4 10 -2
α
1
10 -4
MIRFlickr-25K @ 64 bits
1
τ
1.5
10 4
α
2.5
β
τ
10 -2
β
2
10 2
2
10 2
1
10 4
1.5
10
10 -4 10 -2
10 2
2.5
MIRFlickr-25K @ 64 bits
1
10
1
10 -4 1
2.5
1.5
10 -4
-2
MIRFlickr-25K @ 64 bits
10 -2
α
2
10 4
α
β
1
τ
10 -4 10 -2
1 10 2
2
10 2
2 1.5
10 -4 10 -2
10 4
2.5
MIRFlickr-25K @ 64 bits
0.2
10
10 2
1.5
10
2.5
0.3
10 -4
1
10 -4 10 -2
α
1
MIRFlickr-25K @ 64 bits
2
τ
0.4
10 4
α
β
-2
2.5
10 2
2 1.5
10 -4 10 -2
10 4
1.5
10
ACGi2t @ 100
0.3
1
5
3
0.5
NDCGt2i @ 100
0.4
10 2
10 4
α
τ
0.5
10
10 2
10 -4
10 -2
10
CE
α
5
3
10 -4
1
2.5
PT
0.3
10 4
-2
MIRFlickr-25K @ 64 bits
0.5
0.4
1 10 -4
10 -2
10 4
MIRFlickr-25K @ 64 bits
0.5
NDCGi2t @ 100
10 4
α
β
-2
2 1.5
2.5
wMAPt2i @ 500
10 4
α
-2
ACGt2i @ 100
10 2
1
wMAPi2t @ 500
1
2
1.5
MIRFlickr-25K @ 64 bits
2.5
wMAPt2i @ 500
0.2 10 -4
10 -2
NDCGi2t @ 100
0.3
ED
10 -4
0.4
ACGt2i @ 100
0.2
MIRFlickr-25K @ 64 bits
2.5
ACGt2i @ 100
0.3
MIRFlickr-25K @ 64 bits
2.5
ACGi2t @ 100
NDCGt2i @ 100
NDCGi2t @ 100
0.4
AC
MIRFlickr-25K @ 64 bits
0.5
wMAPi2t @ 500
MIRFlickr-25K @ 64 bits
0.5
wMAPi2t @ 500
MIRFlickr-25K @ 64 bits
M
which make the model hard to train.
ACGi2t @ 100
470
10
ACCEPTED MANUSCRIPT
1
1 10
β
IAPRTC-12 @ 64 bits
10 -2
1
10 4
1 10 2
0.2 -4
0.3 0.25 0.2
10
2
10 4
α
1
5
3
7
1 10
2
10 4
α
τ
IAPRTC-12 @ 64 bits
5
3
1
7
0.2 10 -4 1 10 2
β
10 4
1
5
3
7
10
1 10 2
1.2 1.1
10 2 10 4
β
τ
5
3
1
7
10
4
1
5
3
1.1 1
1.2 1.1 1 10 -4
10 -2 1 10 2 10
α
τ
4
1
5
3
7
10
1.3
1.2 1.1 1
1.2 1.1
10 -4
10 2
β
τ
10
4
1
5
3
1
10 2
β
τ
10
4
1
wMAPt2i @ 500
1 10 2
1
5
3
7
10 -4
10 -2
1
10 2
1.3 1.2
10
4
1
5
3
7
1.1
1
10 -2
10
1
10 2
10 4
α
τ
1
5
3
7
IAPRTC-12 @ 64 bits
1.4
1
10 2
β
τ
10
4
1
5
3
τ
1.3 1.2 1.1
1
10 -4 7
10
10 -2 1 10 2
β
10 4
1
5
3
7
τ
on the IAPRTC-12 dataset. The hash code length is fixed at 64 bits.
M
In this subsection, we present the value of the objective function with varying iterations on two datasets. The convergence curves on MIRFlickr-25K
ED
and IAPRTC-12 datasets at 64 bits are shown in Fig. 4 (a) and Fig. 4 (b) 475
respectively. From these figures, we can see that the value of the objective function decreases steadily with more iterations, which validates the effectiveness of
PT
Algorithm 1.
×10 7
MIRFlickr @ 64 bits
14
12
13
11
12
10
11
Objective function value
Objective function value
AC
CE
13
9 8 7 6 5 4 3
×10 7
IAPR TC-12 @ 64 bits
10 9 8 7 6 5
0
100
200
300
400
4
500
The number of iterations
0
100
200
300
400
500
The number of iterations
(a)
(b)
Figure 4: Convergence curves on MIRFlickr-25K and IAPRTC-12 datasets at 64 bits.
29
10
τ
Figure 3: The experimental performances for the variations in the value of hyper-parameters
5.5. Convergence Study
10 4
β
IAPRTC-12 @ 64 bits
10 -2
10
10 4
α
10 -4
10 -2
10
10 -2
β
IAPRTC-12 @ 64 bits
1.3
7
1
10 -4
α
τ
1.3
1
1.1
10 4
1.4
10 2
1.4
1
1
1
IAPRTC-12 @ 64 bits
1.1
10 -2
10 -2
1.4
1.2
10
-4
10 2
IAPRTC-12 @ 64 bits
1.2
10
10 4
α
β
1.4
10 -2
10
10 2
10 -4 7
1.2
10 -4 1
1.3
10 -4 1
1
10 -2
10 4
IAPRTC-12 @ 64 bits
IAPRTC-12 @ 64 bits
10 -2
10
-4
10 2
1.4
α
0.2
10 4
α
β
1
10 -4 10 -2
1
10 4
10 -2
1.3
τ
0.3
1 10 -4
1.4
10 2
0.25
1.1
1.3
10
ACGi2t @ 100
NDCGt2i @ 100
0.3 0.25
10 -2
10 -2
0.35
1.2
1.4
IAPRTC-12 @ 64 bits
0.35
10
-4
10 2
10 -4
10 -2
10
1 10 -2
1
10 -4 1
1.1
IAPR TC-12 @ 64 bits
ACGi2t
0.3 0.25
10 4
α
β
IAPRTC-12 @ 64 bits
NDCGt2i @ 100
NDCGi2t @ 100
10 -4
10 2
0.35
10 -2
NDCGi2t @ 100
10 4
α
0.35
10
2
1.2
10 -4 10 -2
wMAPt2i @ 500
10 -2
1
CR IP T
10 -4
10 4
1.1
ACGt2i @ 100
10 4
α
10 2
1.2
10 -4
10 -2
ACGt2i @ 100
10
2
1.3
wMAPi2t @ 500
1
1.4
1.3
wMAPt2i @ 500
0.2 10 -4
10 -2
IAPRTC-12 @ 64 bits
1.4
1.3
wMAPi2t @ 500
0.3 0.25
IAPRTC-12 @ 64 bits
1.4
1.3
wMAPi2t @ 500
-4
IAPRTC-12 @ 64 bits
1.4
AN US
0.2
ACGi2t @ 100
NDCGt2i @ 100
NDCGi2t @ 100
0.3 0.25
10
IAPRTC-12 @ 64 bits
IAPRTC-12 @ 64 bits
0.35
ACGt2i @ 100
IAPRTC-12 @ 64 bits
0.35
10
ACCEPTED MANUSCRIPT
Table 7: The comparison of training time and test time (in seconds) on MIRFlickr-25K dataset from 16 bits to 128 bits. MIRFlickr-25K 16 bits
32 bits
64 bits
128 bits
Training
Test time (s)
Training
Test time (s)
Training
CR IP T
Method Test time (s)
time (s)
Image
Text
time (s)
Image
Text
time (s)
Image
LSRH [33]
3.45e1
7.60e-3
3.80e-3
5.73e1
1.06e-2
5.70e-3
1.13e2
SCM [26]
1.53e4
7.90e-3
2.70e-3
2.36e4
5.70e-3
2.40e-3
Training
Test time (s)
Text
time (s)
Image
Text
1.89e-2
1.08e-2
2.24e2
3.50e-2
1.97e-2
4.56e4
1.03e-2
5.30e-3
9.15e4
1.59e-2
6.80e-3
1.94e2
7.86e-2
3.70e-2
2.00e2
6.42e-2
3.10e-2
3.22e2
6.21e-2
3.09e-2
4.89e2
6.30e-2
3.00e-2
3.86e1
6.63e-2
2.49e-2
4.02e1
5.29e-2
1.65e-2
6.00e1
5.72e-2
2.51e-2
9.67e1
5.89e-2
2.61e-2
DCMH [35]
5.16e3
1.56e0
5.34e-2
5.33e3
1.64e0
4.52e-2
5.53e3
1.68e0
3.99e-2
6.53e3
1.51e0
5.26e-2
PRDH [36]
6.00e3
1.54e0
4.95e-2
6.34e3
1.56e0
5.81e-2
6.77e3
1.69e0
4.74e-2
8.33e3
1.40e0
4.71e-2
GLSP
9.29e3
1.53e0
5.36e-2
9.88e3
1.57e0
4.32e-2
1.06e4
1.44e0
5.41e-2
1.24e4
1.63e0
6.53e-2
AN US
SePH [30] RoPH [31]
5.6. Comparison of Training and Test Time
The training and test time for all baselines are listed in Tables 7 and 8. For the training time, the traditional cross-modal hashing methods including LSRH,
M
480
SePH and RoPH are more efficient than deep cross-modal hashing methods including DCMH, PRDH, and GLSP. The proposed method GLSP is slightly
ED
slower than DCMH and PRDH. For the test time, the time cost of CNN feature extraction is not considered for the traditional cross-modal hashing methods 485
including LSRH, SCM, SePH, and RoPH. If the time of extracting feature is
PT
considered in these non-deep cross-modal hashing methods, then the deep crossmodal hashing methods can have competitive efficiency with them. Because
CE
DCMH, PRDH, and GLSP share the same network architecture, their test times are closed. Therefore, in general, the proposed method GLSP is effective and relatively efficient for the large-scale retrieval tasks.
AC
490
6. Conclusion To improve the discriminative ability while preserving the multilevel seman-
tic structure of hash codes, we propose a global and local semantics-preserving
30
ACCEPTED MANUSCRIPT
Table 8: The comparison of training time and test time (in seconds) on IAPRTC-12 dataset from 16 bits to 128 bits. IAPRTC-12 16 bits
32 bits
64 bits
128 bits
Training
Test time (s)
Training
Test time (s)
Training
CR IP T
Method Test time (s)
time (s)
Image
Text
time (s)
Image
Text
time (s)
Image
LSRH [33]
3.63e1
7.40e-3
7.90e-3
7.20e1
1.06e-2
1.11e-2
1.47e2
SCM [26]
1.10e4
3.70e-3
3.80e-3
2.36e4
5.60e-3
5.70e-3
Training
Test time (s)
Text
time (s)
Image
Text
1.82e-2
2.00e-2
2.87e2
3.73e-2
3.67e-2
4.62e4
8.00e-3
9.20e-3
8.99e4
1.52e-2
1.71e-2
1.60e2
5.79e-2
6.33e-2
2.19e2
5.65e-2
5.79e-2
2.77e2
5.99e-2
6.46e-2
4.84e2
5.86e-2
6.31e-2
3.14e1
5.54e-2
6.28e-2
4.17e1
5.68e-2
6.17e-2
6.30e1
6.05e-2
6.50e-2
9.41e1
5.67e-2
6.12e-2
DCMH [35]
5.42e3
1.61e0
1.59e-1
5.51e3
1.43e0
1.26e-1
5.51e3
1.70e0
1.43e-1
6.87e3
1.54e0
2.03e-1
PRDH [36]
6.38e3
1.55e0
1.43e-1
6.29e3
1.60e0
1.25e-1
6.47e3
1.40e0
1.26e-1
8.46e3
1.67e0
1.74e-1
GLSP
1.07e4
1.62e0
1.53e-1
1.12e4
1.65e0
1.63e-1
1.18e4
1.57e0
1.64e-1
1.31e4
1.37e0
1.30e-1
AN US
SePH [30] RoPH [31]
based deep hashing method for cross-modal retrieval. More specifically, we in495
troduce a metric learning method for improving the discriminative ability of hash codes from different modalities. Meanwhile the local semantic structure is
M
preserved into the hash codes. In addition, a multilevel semantic affinity matrix is constructed to learn the global semantic structure preserving hash codes for
500
ED
each modality. Subsequently, in order to further enhance the performance of cross-modal retrieval, a consistent regularization term is introduced to guarantee that the resulting hash codes are as consistent as possible across different
PT
modalities. Finally, they are integrated into an end-to-end learning framework. The entire network model is trained via an efficient optimization algorithm.
CE
The experimental results carried out on two cross-modal benchmark datasets 505
demonstrate the effectiveness of the proposed method. In order to learn consistent hash codes across different modalities, two simple
AC
consistent square constraints are introduced. Some other complex technologies such as collective quantization, collective matrix factorization and so on can be used to further improve the correlations across different modalities. In addition,
510
different network models can be also exploited in the future work.
31
ACCEPTED MANUSCRIPT
Acknowledgment This work was supported in part by National Natural Science Foundation of
References 515
References
CR IP T
China under Grant 61525102, Grant 61502084, and Grant 61601102.
[1] L. Ma, H. Li, F. Meng, Q. Wu, K. N. Ngan, Learning efficient binary codes
from high-level feature representations for multi-label image retrieval, IEEE
AN US
Transactions on Multimedia PP (99) (2017) 1–1.
[2] A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in: International Conference on Very Large Data Bases, 1999, pp.
520
518–529.
M
[3] Y. Gong, S. Lazebnik, A. Gordo, F. Perronnin, Iterative quantization: A procrustean approach to learning binary codes for large-scale image re-
525
ED
trieval, IEEE Trans. Pattern Anal. Mach. Intell. 35 (12) (2013) 2916–2929. [4] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, in: Advances in Neural Information Processing Systems, 2008, pp. 1753–1760.
PT
[5] W. Liu, J. Wang, S. Kumar, S. Chang, Hashing with graphs, in: International Conference on Machine Learning, 2011, pp. 1–8.
CE
[6] F. Shen, C. Shen, W. Liu, H. T. Shen, Supervised discrete hashing, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp.
530
AC
37–45.
[7] J. Wang, O. Kumar, S. Chang, Semi-supervised hashing for scalable image retrieval, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3424–3431.
32
ACCEPTED MANUSCRIPT
535
[8] T. Song, J. Cai, T. Zhang, C. Gao, F. Meng, Q. Wu, Semi-supervised manifold-embedded hashing with joint feature representation and classifier learning, Pattern Recognition 68 (2017) 99–110.
CR IP T
[9] W. Liu, J. Wang, R. Ji, Y. Jiang, S. Chang, Supervised hashing with kernels, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2074–2081.
540
[10] W. Li, S. Wang, W. Kang, Feature learning based deep supervised hashing with pairwise labels, in: International Joint Conference on Artificial
AN US
Intelligence, 2016, pp. 1711–1717.
[11] S. Chen, F. Shen, Y. Yang, X. Xu, J. Song, Supervised hashing with adaptive discrete optimization for multimedia retrieval, Neurocomputing 253
545
(2017) 97–103.
[12] Y. Xu, F. Shen, X. Xu, L. Gao, Y. Wang, X. Tan, Large-scale image
M
retrieval with supervised sparse hashing, Neurocomputing 229 (2017) 45– 53.
[13] C. Deng, X. Liu, Y. Mu, J. Li, Large-scale multi-task image labeling with
ED
550
adaptive relevance discovery and feature hashing, Signal Processing 112
PT
(2015) 137–145.
[14] C. Deng, H. Deng, X. Liu, Y. Yuan, Adaptive multi-bit quantization for
CE
hashing, Neurocomputing 151 (2015) 319–326. 555
[15] X. Liu, C. Deng, Y. Mu, Z. Li, Boosting complementary hash tables for fast
AC
nearest neighbor search, in: Proceedings of AAAI Conference on Artificial Intelligence, 2017.
[16] D. Zhang, F. Wang, L. Si, Composite hashing with multiple information
560
sources, in: ACM SIGIR Conference on Research and Development in Information Retrieval, 2011, pp. 225–234.
33
ACCEPTED MANUSCRIPT
[17] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, Y. Zhuang, Sparse multi-modal hashing, IEEE Transactions on Multimedia 16 (2) (2014) 427–439. [18] J. Zhou, G. Ding, Y. Guo, Latent semantic sparse hashing for cross-modal
ment in Information Retrieval, 2014, pp. 415–424.
565
CR IP T
similarity search, in: ACM SIGIR Conference on Research and Develop-
[19] G. Ding, Y. Guo, J. Zhou, Collective matrix factorization hashing for multi-
modal data, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2083–2090.
AN US
[20] G. Irie, H. Arai, Y. Taniguchi, Alternating co-quantization for cross-modal hashing, in: IEEE International Conference on Computer Vision, 2015, pp.
570
1886–1894.
[21] D. Wang, X. Gao, X. Wang, L. He, Semantic topic multimodal hashing for cross-media retrieval, in: Proceedings of the International Joint Conference
575
M
on Artificial Intelligence, 2015, pp. 3890–3896.
[22] L. Xie, L. Zhu, G. Chen, Unsupervised multi-graph cross-modal hashing
ED
for large-scale multimedia retrieval, Multimedia Tools Appl. 75 (15) (2016) 9185–9204.
PT
[23] M. Long, Y. Cao, J. Wang, P. S. Yu, Composite correlation quantization for efficient multimodal retrieval, in: Proceedings of International ACM SIGIR conference on Research and Development in Information Retrieval,
580
CE
2016, pp. 579–588.
AC
[24] L. Zhu, Z. Huang, X. Liu, X. He, J. Sun, X. Zhou, Discrete multimodal
585
hashing with canonical views for robust mobile landmark search, IEEE Transactions on Multimedia 19 (9) (2017) 2066–2079.
[25] X. Liu, J. He, C. Deng, B. Lang, Collaborative hashing, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2147–2154.
34
ACCEPTED MANUSCRIPT
[26] D. Zhang, W. Li, Large-scale supervised multimodal hashing with semantic correlation maximization, in: Proceedings of AAAI Conference on Artificial Intelligence, 2014, pp. 2177–2183. [27] B. Wu, Q. Yang, W. Zheng, Y. Wang, J. Wang, Quantized correlation
CR IP T
590
hashing for fast cross-modal search, in: Proceedings of International Joint Conference on Artificial Intelligence, 2015, pp. 3946–3952.
[28] H. Liu, R. Ji, Y. Wu, G. Hua, Supervised matrix factorization for crossmodality hashing, in: Proceedings of International Joint Conference on Artificial Intelligence, 2016, pp. 1767–1773.
AN US
595
[29] D. Wang, X. Gao, X. Wang, L. He, B. Yuan, Multimodal discriminative binary embedding for large-scale cross-modal retrieval, IEEE Transactions on Image Processing 25 (10) (2016) 4540–4554.
[30] Z. Lin, G. Ding, J. Han, J. Wang, Cross-view retrieval via probability-based
(2017) 1–14.
M
semantics-preserving hashing, IEEE Transactions on Cybernetics PP (99)
600
ED
[31] K. Ding, B. Fan, C. Huo, S. Xiang, C. Pan, Cross-modal hashing via rankorder preserving, IEEE Transactions on Multimedia 19 (3) (2017) 571–585.
PT
[32] X. Xu, F. Shen, Y. Yang, H. T. Shen, X. Li, Learning discriminative binary codes for large-scale cross-modal retrieval, IEEE Transactions on Image
605
CE
Processing 26 (5) (2017) 2494–2507. [33] K. Li, G. Qi, J. Ye, K. A. Hua, Linear subspace ranking hashing for cross-
AC
modal retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 39 (9) (2017)
610
1825–1838.
[34] L. Liu, Z. Lin, L. Shao, F. Shen, G. Ding, J. Han, Sequential discrete hashing for scalable cross-modality similarity retrieval, IEEE Trans. Image Processing 26 (1) (2017) 107–118.
35
ACCEPTED MANUSCRIPT
[35] Q. Jiang, W. Li, Deep cross-modal hashing, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3270–3278. 615
[36] E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, X. Gao, Pairwise relationship
CR IP T
guided deep hashing for cross-modal retrieval, in: Proceedings of AAAI Conference on Artificial Intelligence, 2017, pp. 1618–1625.
[37] Y. Cao, M. Long, J. Wang, Q. Yang, P. S. Yu, Deep visual-semantic hashing for cross-modal retrieval, in: Proceedings of ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2016, pp. 1445–
620
AN US
1454.
[38] Y. Cao, M. Long, J. Wang, S. Liu, Collective deep quantization for efficient cross-modal retrieval, in: Proceedings of AAAI Conference on Artificial Intelligence, 2017, pp. 3974–3980. 625
[39] Y. Cao, M. Long, J. Wang, Correlation hashing network for efficient cross-
M
modal retrieval, CoRR abs/1602.06697.
[40] L. Liu, F. Shen, Y. Shen, X. Liu, L. Shao, Deep sketch hashing: Fast
ED
free-hand sketch-based image retrieval, CoRR abs/1703.05605. [41] C. Deng, Z. Chen, X. Liu, X. Gao, D. Tao, Triplet-based deep hashing network for cross-modal retrieval, IEEE Transactions on Image Processing
630
PT
27 (8) (2018) 3893–3903.
[42] F. Zhao, Y. Huang, L. Wang, T. Tan, Deep semantic ranking based hashing
CE
for multi-label image retrieval, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1556–1564.
[43] K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey on
AC
635
cross-modal retrieval, CoRR abs/1607.06215.
[44] N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, N. Vasconcelos, A new approach to cross-modal multimedia retrieval, in: Proceedings of the International Conference on Multimedia,
640
2010, pp. 251–260. 36
ACCEPTED MANUSCRIPT
[45] C. Deng, X. Tang, J. Yan, W. Liu, X. Gao, Discriminative dictionary learning with common label alignment for cross-modal retrieval, IEEE Trans. Multimedia 18 (2) (2016) 208–218.
CR IP T
[46] V. E. Liong, J. Lu, Y. Tan, J. Zhou, Deep coupled metric learning for cross-modal matching, IEEE Trans. Multimedia 19 (6) (2017) 1234–1244.
645
[47] H. J´egou, M. Douze, C. Schmid, Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell. 33 (1) (2011) 117–128.
[48] A. Babenko, V. S. Lempitsky, Additive quantization for extreme vector
nition, 2014, pp. 931–938.
650
AN US
compression, in: IEEE Conference on Computer Vision and Pattern Recog-
[49] T. Zhang, C. Du, J. Wang, Composite quantization for approximate nearest neighbor search, in: Proceedings of the International Conference on Machine Learning, 2014, pp. 838–846.
M
[50] K. He, F. Wen, J. Sun, K-means hashing: An affinity-preserving quantization method for learning binary compact codes, in: IEEE Conference on
655
ED
Computer Vision and Pattern Recognition, 2013, pp. 2938–2945. [51] X. Liu, Z. Li, C. Deng, D. Tao, Distributed adaptive binary quantization for fast nearest neighbor search, IEEE Trans. Image Processing 26 (11)
660
PT
(2017) 5324–5336.
[52] X. Liu, B. Du, C. Deng, M. Liu, B. Lang, Structure sensitive hashing with
CE
adaptive product quantization, IEEE Trans. Cybernetics 46 (10) (2016) 2252–2264.
AC
[53] T. Zhang, J. Wang, Collaborative quantization for cross-modal similarity
665
search, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2036–2045.
[54] E. Yang, C. Deng, C. Li, W. Liu, J. Li, D. Tao, Shared predictive crossmodal deep quantization, IEEE Transactions on Neural Networks and Learning Systems PP (99) (2018) 1–12. 37
ACCEPTED MANUSCRIPT
[55] B. Dai, R. Guo, S. Kumar, N. He, L. Song, Stochastic generative hashing, in: Proceedings of the International Conference on Machine Learning, 2017,
670
pp. 913–922.
CR IP T
[56] X. Liu, L. Huang, C. Deng, B. Lang, D. Tao, Query-adaptive hash code
ranking for large-scale multi-view visual search, IEEE Trans. Image Processing 25 (10) (2016) 4514–4524. 675
[57] X. Liu, L. Huang, C. Deng, J. Lu, B. Lang, Multi-view complementary hash tables for nearest neighbor search, in: IEEE International Conference
AN US
on Computer Vision, ICCV, 2015, pp. 1107–1115.
[58] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: Delving deep into convolutional nets, in: British Machine Vision Conference, 2014.
680
[59] J. Lu, J. Hu, Y. P. Tan, Discriminative deep metric learning for face and
4269–4282.
M
kinship verification, IEEE Transactions on Image Processing 26 (9) (2017)
ED
[60] K. Sohn, Improved deep metric learning with multi-class n-pair loss objective, in: Advances in Neural Information Processing Systems, 2016, pp.
685
1849–1857.
PT
[61] J. Lu, G. Wang, W. Deng, P. Moulin, J. Zhou, Multi-manifold deep metric learning for image set classification, in: IEEE Conference on Computer
CE
Vision and Pattern Recognition, 2015, pp. 1137–1145. 690
[62] M. J. Huiskes, M. S. Lew, The MIR flickr retrieval evaluation, in: Proceed-
AC
ings of ACM SIGMM International Conference on Multimedia Information Retrieval, 2008, pp. 39–43.
[63] H. J. Escalante, C. A. Hern´ andez, J. A. Gonz´ alez, A. L´ opez-L´ opez,
695
M. Montes-y-G´ omez, E. F. Morales, L. E. Sucar, L. V. Pineda, M. Grubinger, The segmented and annotated IAPR TC-12 benchmark, Computer Vision and Image Understanding 114 (4) (2010) 419–428. 38
ACCEPTED MANUSCRIPT
[64] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, F. Li, Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.
AC
CE
PT
ED
M
AN US
CR IP T
700
39
ACCEPTED MANUSCRIPT 1
Lei Ma received the B.Sc. degree in communication engineering from Hubei University in 2012, and he is currently working toward the Ph.D degree in signal and information processing in the Intelligent Visual Information Processing and Communication Laboratory (IVIPC) at University of Electronic Science and Technology of China (UESTC). His research interests include large scale multimedia indexing and
AC
CE
PT
ED
M
AN US
CR IP T
retrieval, computer vision, pattern recognition and machine learning.
May 8, 2017
DRAFT
ACCEPTED MANUSCRIPT 2
Hongliang Li (SM’12) received his Ph.D. degree in Electronics and Information Engineering from Xian Jiaotong University, China, in 2005. From 2005 to 2006, he joined the visual signal processing and communication laboratory (VSPC) of the Chinese University of Hong Kong (CUHK) as a Research Associate. From 2006 to 2008, he was a Postdoctoral Fellow at the same laboratory in CUHK. He is currently a Professor in the School of Electronic Engineering, University of Electronic Science and Technology of China. His research interests include image segmentation, object detection, image and video coding, visual attention, and multimedia communication system.
CR IP T
Dr. Li has authored or co-authored numerous technical articles in well-known international journals and conferences. He is a co-editor of a Springer book titled Video segmentation and its applications. Dr. Li was involved in many professional activities. He is a member of the Editorial Board of the Journal on Visual Communications and Image Representation, and the Area Editor of Signal Processing: Image Communication, Elsevier Science. He served as a Technical Program Co-chairs for VCIP2016 and ISPACS 2009, General co-chair of the ISPACS 2010, Publicity co-chair of IEEE VCIP 2013 local chair of the IEEE ICME 2014, and TPC members in a number of international conferences, e.g., ICME 2013, ICME 2012, ISCAS 2013, PCM 2007,
AC
CE
PT
ED
M
AN US
PCM 2009, and VCIP 2010. He is now a senior member of IEEE.
May 8, 2017
DRAFT
ACCEPTED MANUSCRIPT 3
Fanman Meng (S’12-M’14) received the Ph.D. degree in signal and information processing from University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2014. From July 2013 to July 2014, he joined Division of Visual and Interactive Computing of Nanyang Technological University in Singapore as a Research Assistant. He is currently an Associate professor in the School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China. His research interests include image segmentation and object detection. Dr. Meng has authored or co-authored numerous technical articles in well-known international journals and conferences. He
CR IP T
received the ”Best student paper honorable mention award” for the 12th Asian Conference on Computer Vision (ACCV 2014) in Singapore and the ”Top 10% paper award” in the IEEE International Conference on Image Processing (ICIP 2014) at Paris,
AC
CE
PT
ED
M
AN US
France. He is now a member of IEEE and IEEE CAS society.
May 8, 2017
DRAFT
ACCEPTED MANUSCRIPT 4
Qingbo Wu (S’12-M’13) received the B.E. degree in Education of Applied Electronic Technology from Hebei Normal University in 2009, and the Ph.D. degree in signal and information processing in University of Electronic Science and Technology of China in 2015. From February 2014 to May 2014, he was a Research Assistant with the Image and Video Processing (IVP) Laboratory at Chinese University of Hong Kong. Then, from October 2014 to October 2015, he served as a visiting scholar with the Image & Vision Computing (IVC) Laboratory at University of Waterloo. He is currently a lecturer in the School of Electronic Engineering, University of Electronic Science and Technology of China. His research interests include image/video
AC
CE
PT
ED
M
AN US
CR IP T
coding, quality evaluation, and perceptual modeling and processing.
May 8, 2017
DRAFT
ACCEPTED MANUSCRIPT 5
King N. Ngan (F’00) received the Ph.D. degree in Electrical Engineering from the Loughborough University in U.K. He is currently a chair professor at the Department of Electronic Engineering, Chinese University of Hong Kong. He was previously a full professor at the Nanyang Technological University, Singapore, and the University of Western Australia, Australia. He has been appointed Chair Professor 705
at the University of Electronic Science and Technology, Chengdu, China, under the National Thousand Talents Program since 2012. He holds honorary and visiting professorships of numerous universities in
China, Australia and South East Asia.
CR IP T
Prof. Ngan served as associate editor of IEEE Transactions on Circuits and Systems for Video Technology, Journal on Visual Communications and Image Representation, EURASIP Journal of Signal Processing: Image Communication, and Journal of Applied Signal Processing. He chaired and co-chaired a number of prestigious international conferences on image and video processing including the 2010 IEEE International Conference on Image Processing, and served on the advisory and technical committees of numerous professional organizations. He has published extensively including 3 authored books, 7 edited volumes, over 400 refereed technical papers, and edited 9 special issues in journals. In addition, he holds 15 patents in the areas of image/video coding and communications.
AN US
Prof. Ngan is a Fellow of IEEE (U.S.A.), IET (U.K.), and IEAust (Australia), and an IEEE Distinguished Lecturer in 2006-
AC
CE
PT
ED
M
2007.
May 8, 2017
DRAFT