ARTICLE IN PRESS
JID: NEUCOM
[m5G;December 10, 2019;18:46]
Neurocomputing xxx (xxxx) xxx
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Deep semantic cross modal hashing with correlation alignment Meijia Zhang a,c, Junzheng Li b, Huaxiang Zhang a,c,∗, Li Liu a,c,∗ a
School of Information Science and Engineering, Shandong Normal University, Jinan Shandong 250014, China Informatization Office, Shandong Management University, Jinan 250357, Shandong, China c Institute of Data Science and Technology, Shandong Normal University, Jinan Shandong 250014, China b
a r t i c l e
i n f o
Article history: Received 18 September 2019 Revised 29 October 2019 Accepted 17 November 2019 Available online xxx Communicated by Yongdong Zhang Keywords: Deep neural network Cross modal hashing Inter-modal similarity Correlation alignment Semantic embedding
a b s t r a c t Hashing has been extensively applied to cross modal retrieval due to its low storage and high efficiency. Deep hashing which can well extract features of multi-modal data has received increasing research attention recently. However, most of deep hashing for cross modal retrieval methods do not make full use of the semantic label information and do not fully mine correlation of heterogeneous data. In this paper, we propose a Deep Semantic cross modal hashing with Correlation Alignment (DSCA) method. In DSCA, we design two deep neural networks for image and text modality separately, and learn two hash functions. Firstly, we construct a new similarity for the multi-label data, which can well exploit the semantic information and improve the retrieval accuracy. Simultaneously, we preserve the inter-modal similarity of heterogeneous data features, which can exploit semantic correlation. Secondly, the distributions of heterogeneous data are aligned so as to mine the inter-modal correlation well. Thirdly, the semantic label information is embedded in the hash layer of the text network, which can make the learned hash matrix more stable and make the hash codes more discriminative. Experimental results demonstrate that DSCA outperforms the state-of-the-art methods. © 2019 Elsevier B.V. All rights reserved.
1. Introduction With the rapid advancement of information technology such as Internet technology, there exits explosive of social media such as images, texts and videos, which facilitates the transformation of describing things from single modality [1–7] to multi-modality. How to effectively carry out semantic correlation analysis and similarity measurement of these multi-modal data and realize mutual retrieval of each modality has become a new research hotspot in the field of artificial intelligence. Considering the low storage cost and high computational efficiency, hash methods [8–20,22,23,34–36] have been introduced to cross modal retrieval field, which focus on embedding samples of different modalities into a common hamming space. Inter-Media Hashing (IMH) [8] explored the inter-modality and intra-modality correlations with projecting heterogeneous data into a common Hamming space. Semantic Topic Multi-modal Hashing (STMH) [19] was developed by considering latent semantic information in coding procedure, which first discovered clustering patterns of texts and robust factorizes the matrix of images to obtain ∗ Corresponding authors at: School of Information Science and Engineering, Shandong Normal University, Jinan Shandong 250014, China. E-mail addresses:
[email protected] (H. Zhang),
[email protected] (L. Liu).
multiple semantic topics of texts and concepts of images. However, these methods apply shallow models and project high dimension features to a hamming space, which cannot capture the correlation between heterogeneous data well. Recently, deep learning [24–26] has been widely adopted to mine the correlation of heterogeneous data features, consequently there are many deep hash-based methods for cross modal retrieval [27,28] emerged. Deep Cross-Modal Hashing (DCMH) [27] was a representative work that integrated feature learning and hash code learning into the same framework, which can preserve inter-modal similarity nicely, but it lacks the consideration of semantic label information. Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval (DSMHN) [28] not only preserved inter-modal similarity, but also exploited intra-modal semantic label, however, it adopted label predicting to exploit semantic information which was not stable. In addition, the inter-modal correlation of heterogeneous data is crucial to cross modal retrieval, which needs further research. To well exploit the semantic information and correlation of heterogeneous data, thus improve the retrieval accuracy, in this paper, we propose an end-to-end method-Deep Semantic cross modal hashing with Correlation Alignment (DSCA). In DSCA, two deep neural networks for image and text modality both contain semantic layers and hash layers, thus two hash functions will be learned. To learn efficient hash functions, firstly, we construct a new similarity for the multi-label data, which can well exploit the
https://doi.org/10.1016/j.neucom.2019.11.061 0925-2312/© 2019 Elsevier B.V. All rights reserved.
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
JID: NEUCOM 2
ARTICLE IN PRESS
[m5G;December 10, 2019;18:46]
M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx
semantic information than DCMH and DSMHN. Simultaneously, the inter-modal similarity of heterogeneous data features is preserved, which can exploit semantic correlation. Secondly, the distributions of heterogeneous data are aligned to well mine the inter-modal correlation. Thirdly, the semantic label information is embedded in the hash layer of text network, which can make the learned hash matrix more stable and make hash codes more discriminative than label predicting in DSMHN. In addition, the bit balanced constraint is imposed to the method. The main contributions of DSCA are summarized as follows: •
•
•
•
We construct a new similarity for the multi-label data, which can well exploit the semantic information and improve the retrieval accuracy. The inter-modal similarity of heterogeneous data features is preserved, which can exploit semantic correlation. The distributions of heterogeneous data are aligned to mine the inter-modal correlation well. The semantic label information is embedded in the hash layer of text network, which can make the learned hash matrix more stable and make hash codes more discriminative.
The rest of this paper is organized as follows. Section 2 contains the related work in cross modal retrieval. Our proposed method and learning algorithm are presented in Section 3. Experimental results and corresponding analysis are shown in Section 4. Finally, our work is concluded in Section 5. 2. Related work According to whether the semantic label information is used during the training process, cross modal hashing methods can be divided into unsupervised methods [8,10,12,13,29] and supervised [14–17,19,20,27,30–33,37–39] methods. The unsupervised methods project the pairwise instances from original feature space to a common hamming space. Song et al. [8] proposed an Inter-media hashing (IMH) method, which could preserve intra-modal similarity and inter-modal similarity with unlabeled training data. Ding et al. [10] proposed a Collective Matrix Factorization Hashing (CMFH) method, in which the heterogeneous data were zeroed and projected into a shared potential semantic space by collective matrix factorization, and then the unified hash codes were learned in this space. Fusion Similarity Hashing (FSH) [12] inserted the fusion similarity based on the graph model of heterogeneous data into the common hamming space learning. Latent Semantic Sparse Hashing (LSSH) [13] learned the latent semantic spaces for each modality through sparse coding. The supervised hashing methods which use semantic label information can make hash codes more discriminative than unsupervised methods. Semantic Correlation Maximization (SCM) [14] extended the canonical correlation analysis with semantic classes. Semantics-Preserving Hashing (SePH) [15] learned the hash codes by minimizing Kullback Leibler (KL)-divergence of binary codes of heterogeneous data in hamming space. Discriminant Cross-modal Hashing (DCH) [16] learned the discriminative hash codes in a unified framework. Kumar et al. [17] introduced spectral hash into cross modal retrieval and proposed a Cross View Hashing (CVH) algorithm. Semantic topic multimodal hashing (STMH) [19] considered latent semantic information in coding procedure. Fast Discrete Cross-Modal Hashing (FDCH) [20] presented a fast cross-modal discrete hash based on semantic label regression, which improved the accuracy and efficiency of cross-modal retrieval effectively. Liu et al. [21] proposed a Subspace Relation Learning for Cross modal Hashing (SRLCH) method, which exploited correlation information with semantic labels and preserved the nonlinear structure. The cross modal retrieval methods mentioned above are all shallow methods, which cannot capture the correlation between
heterogeneous data well. Recently, there are many deep methods for cross modal retrieval [24–28] emerged, but these methods are not able to fully exploit the semantic information and correlation of heterogeneous data. 3. The proposed approach The overview of our proposed method is shown in Fig. 1. There are two networks for image and text modality respectively. For the paired image and text, the raw pixels of an image are input into the image network to learn the hash codes, meanwhile, the Bag of Words (BoW) features of a text are input into the text network to learn the hash codes. Then the semantic label information is embedded in the text network. By training the network with the Back Propagation algorithm and the mini-batch Stochastic Gradient Descent, the efficient hash functions will be learned. 3.1. Notation In this paper, we consider the case where the sample number of the image and text modality is the same. Let X = {xi }ni=1 ∈ Rdx ∗n denotes image modality and Y = {yi }ni=1 ∈ Rdy ∗n denotes text modality, where dx and dy denote the dimension of image and text modality respectively, n denotes the number of samples. L = {li }ni=1 ∈ Rc∗n denotes label matrix, where c is the number of semantic classes. If {xi , yi } belongs to the jth semantic class, l ji = 1, otherwise l ji = 0. D = {xi , yi , li }ni=1 denotes a pair of data points with the label of li . The heterogeneous data xi and yj are associated with similarity matrix S with its element sij . Traditionally, si j = 1 if xi and yj share the same semantic class, and si j = 0 if xi and yj do not share common semantic class. In this paper, we denote the traditional similarity matrix as S1 with its element sij 1 . For multi-label dataset, in order to get a more accurate similarity matrix, we define the element in similarity matrix S2 which is inspired by Jaccard Distance as follows:
si j 2 =
⎧ ⎨
Nli ,l j
max(Nli , Nl j ) ⎩0,
,
Nli ,l j ≥ 1
(1)
Nli ,l j = 0
where li (lj ) is the label vector of the i(j)th data point, li (l j ) = {0, 1} ∈ Rc∗1 . Nli is the number of element-1 in li , Nl j is the number of element-1 in lj , Nli ,l j is the number of times that li and lj equal to 1 simultaneously. 3.2. Hash code learning In the network of image modality, let f(xi ; θ x ) denotes the output of fc8 . In the network of text modality, g(yj ; θ y ) denotes the output of fc2 , θ x and θ y are the parameters of two networks respectively. In order to mine semantic correlation, the inter-modal similarity is preserved during learning, and the similar probability of two samples’ feature is defined as:
p( s i j
| f (xi ; θx ), g(y j ; θy )) =
where is 1 or 2, σ (φi j ) = 1 T 2 F∗i G∗ j
σ ( φi j ) , s i j = 0 1 − σ ( φi j ) , s i j = 0 1
1+e
−φi j
(2)
is the sigmoid function. φi j =
is the inner product of two samples, whose value is proportional to samples similar probability. F∗i = f (xi ; θx ) is the ith column of F ∈ Rk∗n , and G∗j = g(y j ; θy ) is the jth column of G ∈ Rk∗n , where k is the length of hash codes.
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
ARTICLE IN PRESS
JID: NEUCOM
[m5G;December 10, 2019;18:46]
M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx
3
Fig. 1. The overview of our proposed method DSCA.
Firstly, we employ cross entropy loss to preserve inter-modal similarity of hash codes from the image and text modality networks:
Loss = −
n
si j log(σ (φi j )) + (1 − si j ) log(1 − σ (φi j ))
i, j=1
=−
n
(si j φi j − log(1 + eφi j ))
(3)
i, j=1
According to Eq. (3), the pairwise loss is formulated as follows:
J1 = −
n
( s i j φi j 1
− log(1 + eφi j )) − μ
i, j=1
n
( s i j φi j 2
− log(1 + eφi j ))
i, j=1
(4) where μ is the hyper-parameter, sij is the traditional similarity and sij 2 is the multi-label similarity. Secondly, in order to well mine the inter-modal correlation, the distributions of heterogeneous data are aligned as: 1
J2 =
1 x 2 C F − Cy 2F 2 Cx
(5)
where and denote the covariance matrix of image and text modality respectively, they are defined as:
Cx =
1 1 FT F − FT 1F n−1 n
Cy =
1 1 GT G − GT 1G n−1 n
J3 = G − PL2∗ +
(6)
where 1 ∈ Rn is a vector with all elements being 1. Although the inter-modal similarity can be preserved by Eq. (4), the semantic label information is not fully exploited. In [28], the learned hash codes are used to label predicting for image and text modality respectively to exploit semantic information, but label predicting is not stable. Thirdly, we project semantic label to text modality network to exploit semantic information, which is more stable than label
λ P2∗ β
(7)
where L is the label matrix, P ∈ Rk∗c is the projection matrix, λ and β are two hyper-parameters, ∗ denotes the trace norm. Furthermore, to make the output of two networks to be closed to the same hash codes B by inputting paired images and texts, we have:
J4 = B − F2F + B − G2F
(8)
Finally, bit balanced constraint is added to balance the number of “+1” and “−1” which means that each bit of the hash code has an equal opportunity of being “+1” or “−1”, so that the semantic information can be evenly distributed in different hamming dimensions [40].
J5 = F12F + G12F
(9)
Rn
where 1 ∈ is a vector with all elements being 1. Based on the above analysis, the objective function of DSCA is written as below:
min
B , θ x , θy , P
Cy
predicting:
s.t.
J = J1 + α J2 + β J3 + γ J4 + ηJ5 B ∈ {−1, +1}k∗n
(10)
where α , β , γ , η are four hyper-parameters. Here, the effect of each term in the objective function is summarized: J1 can preserve semantic correlation, J2 guarantees correlation alignment. J3 embeds the semantic label, which can learned the discriminative hash codes. J4 can preserve similarity of hash codes, and J5 balances the hash codes. 3.3. Training There are four parameters to be optimized in our DSCA, so we adopt an alternating learning strategy by updating one parameter and fixing the others. Algorithm 1 describes the whole learning algorithm for our DSCA.
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
ARTICLE IN PRESS
JID: NEUCOM 4
[m5G;December 10, 2019;18:46]
M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx
Algorithm 1 the whole learning algorithm for our proposed DSCA. Input: Image X, Text Y, Label matrix L. Output: Parameters θx and θy of image modality and text modality respectively, projection matrix P, hash codes matrix B. Initialize Initialize network parameters θx and θy . hyper-parameters: α , β , γ , η, λ, μ. mini-batch size Nx = Ny =128. iteration number tx = [n/Nx ], ty = [n/Ny ]. Repeat 1: for iter = 1, 2, . . ., tx do 2: Randomly sample Nx images from X to construct a minibatch; 3: For each sampled point xi in mini-batch, calculate the derivative according to Eq. (11); 4: Update the parameter θx with back propagation. 5: end for 6: for iter = 1, 2, . . ., ty do 7: Randomly sample Ny images from Y to construct a minibatch; 8: For each sampled point y j in mini-batch, calculate the derivative according to Eq. (12); 9: Update the parameter θy with back propagation. 10: end for 11: Optimize B according to Eq. (14); 12: Optimize P according to Eq. (17); until a fixed number of iterations.
3.3.1. Update θ x When θ y , B, P are fixed, the parameter θ x of image network can be learned by: n n
1
∂J 1
= σ θi j G ∗ j − s i j 1 G ∗ j + μ σ θi j G ∗ j − s i j 2 G ∗ j ∂ F∗i 2 j=1 2 j=1 1 2 +α F∗i − F1 + 2γ (F∗i − B∗i ) + 2ηF1 (11)
n−1
n
3.3.2. Update θ y When θ x , B, P are fixed, the parameter θ y of text network can be learned by: n n
1
∂J 1
= σ θi j F∗i − si j 1 F∗i + μ σ θi j F∗i − si j 2 F∗i ∂ G∗ j 2 j=1 2 j=1
1 2 −α G∗ j − G1 + β G∗ j − PL∗ j + 2γ G∗ j − B∗ j
n−1 + 2ηG1
n
(12)
3.3.3. Update B When θ y , θ x , P are fixed, the problem in Eq. (10) can be transformed into the following equation by trace operations:
max
γ (T r (BT F ) + T r (BT G )) = T r (BT V )
s.t.
B ∈ {−1, +1}k∗n
B
(13)
where V = γ (F + G ) . Then the hash codes can be updated by:
B = sign(V ) = sign(γ (F + G ))
(14)
3.3.4. Update P 1
As the trace norm is defined as U∗ = tr (UUT ) 2 J3 in Eq. (7) can be written as the following equation by trace operations:
J3 = tr ((G − PL )(G − PL )T ) +
λ tr (PPT ) β
(15)
When θ x , θ y , B are fixed, the projection matrix P can be learned by matrix derivation as follows:
∂J 1 1 = β (G − PL )(−LT ) + λP ∂P 2 2
(16)
By setting Eq. (16) to zero, we can obtain:
P = β GLT (β LLT + λI )−1 where I ∈
Rc∗c
(17)
is an identity matrix.
3.4. Out of sample For a query that is not in the training set, we can generate the hash codes of a query point of image modality xq by using image network as follows:
bxq = hx (xq ) = sign( f (xq ; θx ))
(18)
In the same way, we can generate the hash codes of a query point of text modality yq by using text network as follows:
byq = hy (yq ) = sign(g(yq ; θy ))
(19)
4. Experiments Our proposed DSCA is implemented with the open source MatConvNet framework on a NVIDA GTX 1080 TI GPU. The experiments on three multi-label datasets verify its effectiveness.
4.1. Datasets description To fully evaluate the effectiveness of our method, we test on three public multi-label datasets, including MIRFLICKR-25K dataset, NUS-WIDE-10 dataset and NUS-WIDE-21 dataset. These datasets are introduced in detail as follows. The MIRFLICKR-25K dataset contains 25,0 0 0 images with a total of 24 labels. In this dataset, each image is associated with one or more labels, and each image is connected with several textual tags. Then, we randomly select 20,015 image-text pairs which have at least 20 textual tags. After that, 20 0 0 pairs are selected as query and the rest pairs as the database. In the database, 50 0 0 pairs are selected for training. For deep methods, the features of the images and texts are the raw pixels and 1386 dimensional Bag of Words (BoW) features respectively. For shallow methods, the features of the images and texts are 512 dimensional Scale-Invariant Feature Transform (SIFT) features and 1386 dimensional BoW features respectively. The NUS-WIDE dataset contains approximately 270,0 0 0 images with a total of 81 labels. In the experiments, we select 186,577 image-text pairs that belong to at least one of the 10 mostfrequent concepts and denote the dataset as NUS-WIDE-10. In NUS-WIDE-10, we randomly select 1867 pairs as query and the rest pairs as the database. In the database, 50 0 0 pairs are selected for training. We also select 195,834 image-text pairs that belong to at least one of the 21 different concepts for this dataset, which is denoted as NUS-WIDE-21. In NUS-WIDE-21, 1958 pairs are randomly selected as query from the dataset and the rest pairs constitute the database. In the database, 50 0 0 pairs are selected for training. With regard to both NUS-WIDE-10 and NUS-WIDE-21 datasets, for deep methods, the features of the images and texts are the raw pixels and 10 0 0 dimensional BoW features respectively; for shallow methods, the features of the images and texts are 500 dimensional Bag-of-Visual Words (BoVW) features and 10 0 0 dimensional BoW features respectively.
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
ARTICLE IN PRESS
JID: NEUCOM
[m5G;December 10, 2019;18:46]
M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx Table 1 The mAP scores on MIRFLICKR-25K dataset (Numbers in boldface are the best). Task
Methods
code length
Table 2 The mAP scores on NUS-WIDE-10 dataset (Numbers in boldface are the best). Task
16 bits
32 bits
64 bits
128 bits
5
Methods
code length 16 bits
32 bits
64 bits
128 bits
Image to Text
CCQ CMFH DCH FSH LSSH SCM-Orth SCM-Seq SePH SRLCH DCMH DSCA
0.5615 0.582 0.59 0.5617 0.5728 0.578 0.6159 0.6575 0.6043 0.7013 0.7165
0.5599 0.5788 0.5898 0.5621 0.5742 0.5716 0.6257 0.6668 0.629 0.7078 0.7286
0.5595 0.5583 0.6099 0.5568 0.5749 0.567 0.6322 0.6729 0.6146 0.7156 0.7335
0.5586 0.5575 0.623 0.5568 0.5752 0.5647 0.6346 0.6743 0.6275 0.7147 0.7402
Image to Text
CCQ CMFH DCH FSH LSSH SCM-Orth SCM-Seq SePH SRLCH DCMH DSCA
0.3503 0.3441 0.4342 0.3536 0.3929 0.3843 0.4796 0.5565 0.3772 0.5961 0.609
0.3463 0.3518 0.4534 0.3539 0.393 0.3722 0.4869 0.5668 0.3478 0.6124 0.6261
0.3463 0.3459 0.4929 0.3566 0.3973 0.3642 0.4865 0.5729 0.3516 0.6399 0.6442
0.3473 0.3532 0.5042 0.3637 0.3969 0.3588 0.4907 0.5732 0.3655 0.6565 0.6572
Text to Image
CCQ CMFH DCH FSH LSSH SCM-Orth SCM-Seq SePH SRLCH DCMH DSCA
0.5608 0.5768 0.5924 0.5618 0.5858 0.5824 0.6105 0.7028 0.6045 0.7385 0.7621
0.5596 0.5761 0.5908 0.5633 0.5869 0.5752 0.6168 0.7162 0.6349 0.7443 0.77
0.5594 0.5642 0.6043 0.5571 0.5884 0.5684 0.6202 0.7197 0.6194 0.7464 0.775
0.5599 0.5637 0.6187 0.5571 0.5865 0.5648 0.6225 0.7238 0.6335 0.7467 0.7803
Text to Image
CCQ CMFH DCH FSH LSSH SCM-Orth SCM-Seq SePH SRLCH DCMH DSCA
0.3489 0.3427 0.4239 0.3528 0.4186 0.3792 0.4591 0.6454 0.3562 0.6277 0.6545
0.3472 0.3529 0.4425 0.3546 0.419 0.3686 0.4667 0.6609 0.3622 0.6475 0.676
0.3467 0.3448 0.4787 0.3556 0.4233 0.3618 0.4676 0.6691 0.3654 0.668 0.6944
0.3482 0.3564 0.4869 0.3634 0.4193 0.3586 0.4711 0.6761 0.3637 0.6806 0.6924
4.2. Comparison methods and evaluation metrics We compare DSCA with ten state-of-the-art cross-modal hashing methods, which inlude supervised methods, such as CCQ [9], SCM-Orth [14], SCM-Seq [14], SePH [15], DCH [16], SRLCH [21], DCMH [27], unsupervised methods, such as CMFH [10], FSH [12] and LSSH [13]. Those baselines have been introduced in detail in Section 2. The source codes of those baselines are available online and the parameters are set according to the recommendation of their corresponding paper. Since the image features of the deep methods are the raw pixels and the shallow methods cannot input the raw pixels directly but need to input extracted features, to make a fair comparison, the image features of shallow methods are extracted from the fc7 layer of image neural network. In this work, we verify our proposed method DSCA for two retrieval tasks: Images retrieve Texts (I2T) and Texts retrieve Images (T2I). In addition, we use mean Average Precision (mAP), topN Precision and Precision-Recall (PR) curve to evaluate the final performance of DSCA. 4.3. Experimental results and analysis As is shown in Fig. 1, there are two deep neural networks for image and text modality. Because we do not concern network models, the deep neural network in this paper is the same with DCMH [27]. For image modality network, there are eight layers, where the first seven fully-connected layers are used for feature learning and the last layer is hash layer. For text modality network, there are three layers, where the first two fully-connected layers are used for feature learning and the last layer is hash layer. In our experiments we set α = 1, β = 1, λ = 0.1, γ = 1, η = 1, μ = 1 for MIRFLICKR-25K and NUS-WIDE-10 datasets, for NUSWIDE-21 dataset, α = 1, β = 1, λ = 0.1, γ = 1, η = 1, μ = 0.5. Tables 1–3 show the mAP scores of DSCA and the compared methods on three datasets with hash code length from 16 bits to 128 bits. From the tables we can see that, deep learning based methods achieve better performance than those methods applying shallow models. Moreover, our proposed DSCA achieves the best performance on three datasets, and DCMH obtains a lower performance
Table 3 The mAP scores on NUS-WIDE-21 dataset (Numbers in boldface are the best). Task
Methods
code length 16 bits
32 bits
64 bits
128 bits
Image to Text
CCQ CMFH DCH FSH LSSH SCM-Orth SCM-Seq SePH SRLCH DCMH DSCA
0.3193 0.3032 0.409 0.3219 0.3582 0.3485 0.4303 0.5183 0.3448 0.5576 0.5729
0.3164 0.3253 0.4097 0.3242 0.3578 0.3373 0.44 0.5303 0.3278 0.5705 0.5857
0.3158 0.3161 0.4371 0.328 0.364 0.3297 0.4415 0.535 0.3413 0.598 0.5982
0.3152 0.3135 0.4501 0.3346 0.3614 0.3247 0.4463 0.5382 0.3419 0.6077 0.6088
Text to Image
CCQ CMFH DCH FSH LSSH SCM-Orth SCM-Seq SePH SRLCH DCMH DSCA
0.3189 0.3043 0.3924 0.3226 0.396 0.3455 0.4104 0.6072 0.327 0.6003 0.6335
0.3166 0.325 0.3967 0.3236 0.3951 0.3356 0.4164 0.6273 0.3269 0.6102 0.6435
0.3162 0.3174 0.4189 0.328 0.3974 0.3294 0.4195 0.6315 0.3291 0.6275 0.6609
0.3161 0.3141 0.4331 0.3318 0.3973 0.3271 0.4221 0.6379 0.3299 0.6361 0.6644
than our method, this is because we adopt multi-label similarity, correlation alignment and semantic embedding at the same time, which can improve the retrieval accuracy effectively. We can see that the T2I retrieval task performs better than I2T retrieval task on most of the comparison methods. This means the texts can encode more discriminant information than images. It also can be seen that the effect on MIRFLICKR-25K dataset is more significant than that on NUS-WIDE dataset. This is because MIRFLICKR-25K dataset is more complex than NUS-WIDE dataset, each image contains more objects and has more labels. Consequently, in more complex datasets, the effectiveness of this method is more significant. Figs. 2 and 4 show the Precision-recall curves for two retrieval tasks of different methods on three datasets with 32 and 64 bits hash code respectively. From the figures, we can see that our method achieve the best performance.
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
JID: NEUCOM 6
ARTICLE IN PRESS
[m5G;December 10, 2019;18:46]
M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 2. Precision-recall curves for two retrieval tasks of different methods on three datasets (32-bits hash code).
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
JID: NEUCOM
ARTICLE IN PRESS
[m5G;December 10, 2019;18:46]
M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 3. top-N Precision curves for two retrieval tasks of different methods on three datasets (32-bits hash code).
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
7
JID: NEUCOM 8
ARTICLE IN PRESS
[m5G;December 10, 2019;18:46]
M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 4. Precision-recall curves for two retrieval tasks of different methods on three datasets (64-bits hash code).
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
JID: NEUCOM
ARTICLE IN PRESS
[m5G;December 10, 2019;18:46]
M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 5. top-N Precision curves for two retrieval tasks of different methods on three datasets (64-bits hash code).
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
9
JID: NEUCOM 10
ARTICLE IN PRESS
[m5G;December 10, 2019;18:46]
M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 6. Friedman test on three datasets.
Figs. 3 and 5 show the top-N Precision curves for two retrieval tasks of different methods on three datasets with 32 and 64 bits hash code respectively. To calculate the top-N Precision, we set N as 10 0 0 and 50 as a node. From the figures, we can see that our method achieve the best performance on top-N precision. To ensure the significance of the results from our proposed method, we conduct Friedman test and Wilcoxon test by Statistical Product and Service Solutions (SPSS) analysis on three datasets, which are shown in Figs. 6 and 7. Fig. 6(a) shows that our proposed DSCA method and ten baselines have asymptotic significances on MIRFLICKR-25K dataset. On NUS-WIDE-10 and NUS-WIDE-21 datasets, for simplicity, we only conduct Friedman test with DSCA and the second best performance method DCMH (which is represented in Tables 1–3), the results are shown in Fig. 6(b) and 6(c), which show that DSCA indeed achieves the best performance. Fig. 7 shows Wilcoxon test results on three datasets by comparing DSCA and the second best performance method DCMH, the results also show that DSCA indeed achieves the best performance.
Table 4 Ablation study on MIRFLICKR-25K dataset (32-bits hash code) (Numbers in boldface are the best).
DSCA DSCA-I DSCA-II DSCA-III DSCA-IV
I2T
T2I
Average
0.7286 0.7247 0.7266 0.7267 0.7270
0.7700 0.7672 0.7640 0.7680 0.7681
0.7493 0.74595 0.7453 0.74735 0.74755
Further, four variants of DSCA are designed to evaluate the performance of our method, which is shown in Table 4. DSCA-I only adopts the traditional similarity. DSCA-II ignores the correlation alignment. DSCA-III removes the semantic embedding. DSCAIV adopts the L2 norm. From Table 4, we can see that each term in our proposed DSCA has significance to the result, and DSCA-II has the lowest result, which means that correlation alignment has the largest significance.
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
JID: NEUCOM
ARTICLE IN PRESS M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx
[m5G;December 10, 2019;18:46] 11
Fig. 7. Wilcoxon test on three datasets.
5. Conclusion In this paper, we propose an end-to-end method- Deep Semantic cross modal hashing with Correlation Alignment (DSCA). Based on DCMH, we first construct a new similarity to exploit the semantic information. Then we adopt correlation alignment to mine the inter-modal correlation. Finally, the semantic label information is embedded in the hash layer of text network to obtain discriminative hash codes. Experiments on three datasets show the superiority of our proposed method, as compared with those state-of-the-art methods. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgment The work is partially supported by the National Natural Science Foundation of China (Nos. 61772322, 61702310, U1836216),
the major fundamental research project of Shandong, China (No. ZR2019ZD03) and the Taishan Scholar Project of Shandong, China. References [1] M. Zhao, H. Zhang, L. Meng, An angle structure descriptor for image retrieval, China Commun. 13 (8) (2016) 222–230. [2] W. Hong, M. Li, J. Geng, et al., Novel chaotic bat algorithm for forecasting complex motion of floating platforms, Appl. Math. Model. 72 (2019) 425–443. [3] G. Fan, L. Peng, W. Hong, Short term load forecasting based on phase space reconstruction algorithm and bi-square kernel regression model, Appl. Energy 224 (2018) 13–33. [4] Y. Dong, Z. Zhang, W. Hong, A hybrid seasonal mechanism with a chaotic cuckoo search algorithm with a support vector regression model for electric load forecasting, Energies 11 (4) (2018) 1009. [5] C. Yan, Y. Tu, X. Wang, STAT: Spatial-temporal attention mechanism for video captioning, IEEE Trans. Multimed. (2019), doi:10.1109/TMM.2019.2924576. [6] C. Yan, L. Li, C. Zhang, et al., Cross-modality bridging and knowledge transferring for image understanding, IEEE Trans. Multimed. 21 (10) (2019) 2675–2685. [7] C. Yan, H. Xie, J. Chen, et al., A fast uyghur text detector for complex background images, IEEE Trans. Multimed. 20 (12) (2018) 3389–3398. [8] J. Song, Y. Yang, Y. Yang, et al., Inter-media hashing for large-scale retrieval from heterogeneous data sources, in: Proceedings of ACM SIGMOD, 2013, pp. 785–796. [9] M. Long, Y. Cao, J. Wang, et al., Composite correlation quantization for efficient multimodal retrieval, Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (2016) 579–588.
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061
JID: NEUCOM 12
ARTICLE IN PRESS
[m5G;December 10, 2019;18:46]
M. Zhang, J. Li and H. Zhang et al. / Neurocomputing xxx (xxxx) xxx
[10] G. Ding, Y. Guo, J. Zhou, Collective matrix factorization hashing for multimodal data, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 2075–2082. [11] G. Ding, Y. Guo, J. Zhou, et al., Large-scale cross-modality search via collective matrix factorization hashing, IEEE Trans. Image Process. 25 (11) (2016) 5427–5440. [12] H. Liu, R. Ji, Y. Wu, et al., Cross-modality binary code learning via fusion similarity hashing, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6345–6353. [13] J. Zhou, G. Ding, Y. Guo, Latent semantic sparse hashing for cross-modal similarity search, ACM SIGIR (2014) 415–424. [14] D. Zhang, W. Li, Large-scale supervised multimodal hashing with semantic correlation maximization, in: Proceedings of the Twenty-eighth Aaai Conference on Artificial Intelligence, 2014, pp. 2177–2183. [15] Z. Lin, G. Ding, J. Han, et al., Cross-view retrieval via probability-based semantics-preserving hashing, IEEE Trans. Cybern. 47 (12) (2017) 4342–4355. [16] X. Xu, F. Shen, Y. Yang, et al., Learning discriminative binary codes for large-scale cross-modal retrieval, IEEE Trans. Image Process. 26 (5) (2017) 2494–2507. [17] S. Kumar, R. Udupa, Learning hash functions for cross-view similarity search, in: Proceedings of the 22nd International Joint Conference on Artificial Intelligence IJCAI, Barcelona, Catalonia, Spain, July 16-22, 2011, pp. 1360–1365. [18] G. Irie, H. Arai, Y. Taniguchi, Alternating co-quantization for cross-modal hashing, Proceedings of the IEEE International Conference on Computer Vision (2016) 1886–1894. [19] D. Wang, X. Gao, X. Wang, et al., Semantic topic multimodal hashing for cross– media retrieval, in: Proceedings of the International Conference on Artificial Intelligence, 2015, pp. 3890–3896. [20] X. Liu, X. Nie, W. Zeng, et al., Fast discrete cross-modal hashing with regressing from semantic labels, in: Proceedings of the ACM Multimedia Conference on Multimedia Conference, 2018, pp. 1662–1669. [21] L. Liu, Y. Yang, M. Hu, et al., Index and retrieve multimedia data: Cross– modal hashing by learning subspace relation, in: Proceedings of the International Conference on Database Systems for Advanced Applications, 2018, pp. 606–621. [22] K. Wang, R. He, L. Wang, Joint feature selection and subspace learning for cross-modal retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 38 (10) (2016) 2010–2023. [23] J. Tang, K. Wang, L. Shao, Supervised matrix factorization hashing for cross– modal retrieval, IEEE Trans. Image Process. 25 (7) (2016) 3157–3166. [24] X. Huang, Y. Peng, Deep cross-media knowledge transfer, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8837–8846. [25] Y. Peng, J. Qi, Y. Yuan, CM-GANS: cross-modal generative adversarial networks for common representation learning, IEEE Trans. Multimed. 15 (1) (2019) 22. [26] N. Li, C. Li, C. Deng, et al., Deep joint semantic-embedding hashing, IJCAI (2018) 2397–2403. [27] Q. Jiang, W. Li, Deep cross-modal hashing, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1618–1625. [28] L. Jin, J. Tang, Z. Li, et al., Deep semantic multimodal hashing network for scalable multimedia retrieval, 2019, arXiv preprint arXiv:1901.02662. [29] D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Computation 16 (12) (2004) 2639–2664. [30] N. Rasiwasia, J.C. Pereira, E. Coviello, A new approach to cross-modal multimedia retrieval, International Conference on Multimedia (2010) 251–260. [31] K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey on crossmodal retrieval, 2016, arXiv preprint arXiv:1607.06215. [32] Y. Wei, Y. Zhao, Z. Zhu, et al., Modality-dependent cross-media retrieval, ACM Trans. Intell. Syst. Technol. 7 (4) (2016) 1–13. [33] J. Yan, H. Zhang, J. Sun, Q. Wang, et al., Joint graph regularization based modality-dependent cross-media retrieval, Multimedia Tools Appl. 77 (3) (2018) 3009–3027. [34] L. Zhu, J. Shen, X. Liu, et al., Learning compact visual representation with canonical views for robust mobile landmark search, in: Proceedings of the International Joint Conferences on Artificial Intelligence, 2016, pp. 3959–3965.
[35] L. Zhu, Z. Huang, X. Liu, et al., Discrete multimodal hashing with canonical views for robust mobile landmark search, IEEE Trans. Multimed. 19 (9) (2017) 2066–2079. [36] L. Zhu, Z. Huang, X. Chang, J. Song, H.T. Shen, Exploring consistent preferences: discrete hashing with pair-exemplar for scalable landmark search, in: Proceedings of the 25th ACM International Conference on Multimedia, 2017, pp. 726–734. [37] Y. Wei, Y. Zhao, C. Lu, et al., Cross-modal retrieval with CNN visual features: a new baseline, IEEE Trans. Cybern. 47 (2) (2016) 449–460. [38] J. Tang, Z. Li, Weakly-supervised multimodal hashing for scalable social image retrieval, IEEE Trans. Circuits Syst. Video Technol. 28 (10) (2018) 2730–2741. [39] H. Liu, R. Wang, S. Shan, X. Chen, Deep supervised hashing for fast image retrieval, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2064–2072. [40] Y. Weiss, A. Torralba, R. Fergus, Spectral hashing, Adv. Neural Inf. Process. Syst. (NIPS) (2008) 1753–1760. Meijia Zhang received her M.S. degree from School of Optoelectronic Engineering, Chongqing University of Posts and Telecommunications, China, in 2015. She is currently pursuing the Ph.D. degree in the School of Information Science & Engineering from Shandong Normal University. Her research interests include machine learning, pattern recognition, and cross-media retrieval.
Junzheng Li received his M.S. degree in computer software and theory from Shandong Normal University, China, in 2015. He is currently with the Informatization office, Shandong Management University, Jinan. His research interests include machine learning, pattern recognition, cross-media retrieval, and Internet of things technology.
Huaxiang Zhang is currently a professor with the School of Information Science and Engineering & the Institute of Data Science and Technology, Shandong Normal University, China. He received his Ph.D. from Shanghai Jiaotong University in 2004, and worked as an associated professor with the Department of Computer Science, Shandong Normal University from 2004 to 2005. He has authored over 180 journal and conference papers and has been granted 20 invention patents. His current research interests include machine learning, pattern recognition, evolutionary computation, cross-media retrieval, web information processing, etc. Li Liu received the Ph.D. degree from Shandong University, where she was involved in computer vision. She is currently an Associate Professor of information science and engineering with Shandong Normal University. Her research interests include image processing, computer vision and pattern recognition.
Please cite this article as: M. Zhang, J. Li and H. Zhang et al., Deep semantic cross modal hashing with correlation alignment, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.061