Neural Networks 125 (2020) 142–152
Contents lists available at ScienceDirect
Neural Networks journal homepage: www.elsevier.com/locate/neunet
Fast discrete cross-modal hashing with semantic consistency ∗
Tao Yao a,b , , Lianshan Yan b , Yilan Ma a , Hong Yu a , Qingtang Su a , Gang Wang a , Qi Tian c a b c
Department of Information and Electrical Engineering, Ludong University, Yantai, 264000, China Yantai Research Institute of New Generation Information Technology, Southwest Jiaotong University, 264000, China Huawei Noah’s Ark Lab, 518129, China
article
info
Article history: Received 28 March 2019 Received in revised form 17 December 2019 Accepted 28 January 2020 Available online 11 February 2020 Keywords: Cross-modal retrieval Semantic consistency Discrete optimization Hashing
a b s t r a c t Supervised cross-modal hashing has attracted widespread concentrations for large-scale retrieval task due to its promising retrieval performance. However, most existing works suffer from some of following issues. Firstly, most of them only leverage the pair-wise similarity matrix to learn hash codes, which may result in class information loss. Secondly, the pair-wise similarity matrix generally lead to high computing complexity and memory cost. Thirdly, most of them relax the discrete constraints during optimization, which generally results in large cumulative quantization error and consequent inferior hash codes. To address above problems, we present a Fast Discrete Cross-modal Hashing method in this paper, FDCH for short. Specifically, it firstly leverages both class labels and the pairwise similarity matrix to learn a sharing Hamming space where the semantic consistency can be better preserved. Then we propose an asymmetric hash codes learning model to avoid the challenging issue of symmetric matrix factorization. Finally, an effective and efficient discrete optimal scheme is designed to generate discrete hash codes directly, and the computing complexity and memory cost caused by the pair-wise similarity matrix are reduced from O(n2 ) to O(n), where n denotes the size of training set. Extensive experiments conducted on three real world datasets highlight the superiority of FDCH compared with several cross-modal hashing methods and demonstrate its effectiveness and efficiency. © 2020 Elsevier Ltd. All rights reserved.
1. Introduction In the last several decades, tremendous multimedia data have been generated on the Internet. These instances generally are represented by high-dimensional vectors (Li, Fan, Zhang, & Huang, 2019), which makes traditional methods unpractical to perform nearest neighbor retrieval. Hashing methods, which map highdimensional feature vectors into low-dimensional hash codes, have been recognized as an effective tool to facilitate large-scale applications due to its high efficiency in computation and storage (Pan, Yao, Li, Ngo, & Mei, 2015; Wu & Zou, 2014). Accordingly, many hashing works have been proposed to address large-scale retrieval issue (Deng, Yang, Liu, Liu et al., 2019; Deng, Yang, Liu & Tao, 2019; Hu et al., 2019; Hu, Yang, Shen, Xie, & Shen, 2018; Lin, Shen & Av, 2015; Liu, Lin et al., 2017; Liu et al., 2019; Lu et al., 2019; Lu, Zhu, Cheng, Nie, & Zhang, 2019b; Luo, Yang et al., 2018; Shen, Shen, Liu, & Shen, 2015; Shi, Xing, Xu, Sapkota & Yang, 2017; Tang, Wang, & Shao, 2016; Weiss, Torralba, & Fergus, 2009; Yao, Kong, Fu, & Qi, 2017; Zhu, Huang, Li, Xie, & Shen, 2018). They generally aim at learning hash codes by minimizing the Hamming ∗ Corresponding author at: Department of Information and Electrical Engineering, Ludong University, Yantai, 264000, China. E-mail address:
[email protected] (T. Yao). https://doi.org/10.1016/j.neunet.2020.01.035 0893-6080/© 2020 Elsevier Ltd. All rights reserved.
distances of similar instances and maximizing the Hamming distances of dissimilar ones, then the similarities among instances can be calculated efficiently in Hamming space by XOR operation. However, most of them are only designed for single modality, which cannot be applied to multi-modal instances directly (Hu et al., 2018; Lin, Shen et al., 2015; Luo, Yang et al., 2018; Shen et al., 2015; Shi, Xing et al., 2017; Weiss et al., 2009). On the Internet, multimedia data typically are derived from different sources, e.g., one web page with texts, images and videos. Although these instances are represented by different modalities, they may contain same semantic concepts. Therefore, when users submit a query to a search engine, it is better to return similar instances with various modalities which can provide complementary information of the query. However, the similarities among heterogeneous data cannot be measured directly due to the heterogeneous gap. The key challenge of cross-modal retrieval is to learn a sharing space therefore the similarities among heterogeneous data can be measured. For the purpose of tackling this challenge, quite a few works have been designed for the cross-modal retrieval task (Peng, Huang, & Qi, 2016; Peng, Qi, & Huang, 2018; Rasiwasia et al., 2010; Wang, Yang, Xu, Hanjalic, & Shen, 2017). However, one of the major challenges of these methods is that they cannot handle large-scale multimedia retrieval efficiently.
T. Yao, L. Yan, Y. Ma et al. / Neural Networks 125 (2020) 142–152
More recently, cross-modal hashing, which formulates to project heterogeneous instances into a sharing Hamming space, has brought increasing attention due to its efficiency in both memory cost and computation. Consequently, several cross-modal hashing works have been devised in recent years (Bronstein, Bronstein, Michel, & Paragios, 2010; Chao et al., 2018; Deng, Chen, Liu, Gao, & Tao, 2018; Ding, Guo, & Zhou, 2014; Liu, Lin et al., 2017; Shen et al., 2017; Tang et al., 2016; Wang, Gao, Wang, & He, 2018; Wu, Yang, Zheng, Wang, & Wang, 2015; Yao et al., 2017; Yao, Kong, Fu & Tian, 2019; Yao, Zhang, Yan, Yue & Tian, 2019; Zhang & Li, 2014; Zhen & Yeung, 2012; Zheng et al., 2019). According to whether class labels are used in training procedure, these works can be loosely classified as unsupervised methods and supervised ones. Unsupervised methods generally encode heterogeneous data into homogeneous hash codes by preserving the similarities in data themselves (Ding et al., 2014; Shen et al., 2017). Since the original features typically lack of semantic information, these methods generally cannot gain promising retrieval accuracy. By contrast, supervised methods generally learn homogeneous hash codes by preserving semantic similarities in class labels (Bronstein et al., 2010; Chao et al., 2018; Deng et al., 2018; Liu, Lin et al., 2017; Tang et al., 2016; Wang et al., 2018; Wang, Gao, Wang, He, & Yuan, 2016; Wu et al., 2015; Yao et al., 2017; Zhang & Li, 2014; Zhen & Yeung, 2012). Class labels contain highlevel semantic concepts which can enhance the discrimination of hash codes. Therefore, supervised methods generally gain higher retrieval accuracy than that of unsupervised ones. Several supervised methods, which rely on the pair-wise similarity matrix to guide the hash codes learning procedure, achieve competitive performance (Lin, Ding, Hu & Wang, 2015; Lin, Shen, Shi, Hengel, & Suter, 2014; Liu, Wang, Ji, Jiang, & Chang, 2012; Luo, Zhang et al., 2018; Yao et al., 2017; Yao, Zhang et al., 2019; Zhang & Li, 2014). However, they often suffer from following limitations: (1) The n × n similarity matrix generally results in high time complexity and memory cost, which makes them unscalable to large-scale datasets (Lin, Ding et al., 2015; Lin et al., 2014; Liu et al., 2012; Yao et al., 2017; Yao, Zhang et al., 2019). (2) The pair-wise similarity matrix may result in semantic information loss and consequent less discriminative hash codes. (3) Most of them first relax the discrete constraints and then hash codes are generated by thresholding operation (Ding et al., 2014; Tang et al., 2016; Wang et al., 2018; Yao et al., 2017; Zhang & Li, 2014). However, the thresholding operation generally causes large quantization error and consequent less effective hash codes. Besides, most discrete hashing methods learn hash codes bit-by-bit, which generally makes the training phase time-consuming (Liu, Lin et al., 2017; Luo, Zhang et al., 2018; Shen et al., 2015, 2017; Yao et al., 2019). Motivated by aforementioned observations, a novel crossmodal hashing method is proposed in this paper, which is referred as Fast Discrete Cross-modal Hashing (FDCH). Specifically, both the pair-wise similarity and class label based consistencies are incorporated to learn hash codes, therefore the semantic consistency can be well preserved in hash codes. To avoid the challenging symmetric matrix factorization issue, an asymmetric hash codes learning model is designed. Finally, an effective and fast optimization algorithm is devised to learn discrete hash codes directly, and each variable has a close form solution. The main contributions of this paper are summarized as follows: (1) We provide a fast discrete cross-modal hashing algorithm for large-scale cross-modal retrieval. The learned hash codes are consistent with both the pair-wise similarities and class labels for higher discrimination. (2) To avoid the challenging discrete symmetric matrix factorization issue, we devise an asymmetric hash codes learning model
143
which reduces both the computing complexity and memory cost from O(n2 ) to O(n). Moreover, each variable has a closed-form solution, and the discrete hash codes can be generated directly. Therefore, both the efficiency and effectiveness of the training procedure are significantly improved. (3) The proposed method obtains superior results over baseline methods for both accuracy and efficiency on three real world datasets, i.e., Wiki, Mirflickr25K and NUS-WIDE datasets. The remainder of this paper is organized as follows. Section 2 gives a brief overview of the hashing methods. Section 3 introduces the proposed FDCH algorithm. The performance of our method is demonstrated in Section 4. Section 5 concludes the paper. 2. Related work In this section, we briefly review the hashing methods, including single-modal hashing and cross-modal hashing. Single-modal hashing pays attention to learn compact hash codes on single modality. In recent decade, single-modal hashing has been studied extensively, and a variety of single-modal hashing methods have been proposed (Andoni & Indyk, 2006; Cheng et al., 2017; Gui, Liu, Sun, Tao, & Tan, 2018; Yunchao, Svetlana, Albert, & Florent, 2013). Local Sensitive Hashing (LSH) utilizes random mapping vectors to embed instances into a Hamming space (Andoni & Indyk, 2006). ITerative Quantization (ITQ) proposes to firstly map high-dimensional data into a low-dimensional space by PCA, and then formulates to find an orthogonal rotation matrix to minimize the quantization error of mapping data (Yunchao et al., 2013). Asymmetric Multi-Valued Hashing (AMVH) proposes to learn two distinct non-binary embeddings, i.e., a real-valued embedding and a multi-integer-embedding, for preserving more precise similarities in instances (Cheng et al., 2017). Specifically, the former one is employed to represent the query by a nonlinear transformation, and the latter one is employed for compressing the whole database. Fast Supervised Discrete Hashing (FSDH) utilizes the classification information and pairwise label information to learn hash codes in a stream framework (Gui et al., 2018). Cross-modal hashing aims at learning a sharing Hamming space for heterogeneous data. According to whether class labels are used, cross-modal hashing methods can be loosely classified as unsupervised and supervised ones. Unsupervised cross-modal hashing methods do not take advantage of class labels during training procedure. Inter-Media Hashing (IMH) formulates to preserve inter-modal and intra-modal similarities in the learned Hamming space to enable large-scale cross-modal retrieval (Song, Yang, Yang, Huang, & Shen, 2013). Collective Matrix Factorization Hashing (CMFH) decomposes the heterogeneous data into a sharing space by collective matrix factorization, and then unified hash codes are generated by thresholding operation (Ding et al., 2014). Latent Semantic Sparse Hashing (LSSH) decomposes the image and text modality into two abstract subspaces via sparse coding and matrix factorization respectively, and then they are mapped to a sharing abstract space for generating the hash codes (Zhou, Ding, & Guo, 2014). Composite Correlation Quantization (CCQ) jointly learns a continuous isomorphic space and a composite quantizer for improving the quality of hash codes (Long, Cao, Wang, & Yu, 2016). Fusion Similarity Hashing (FSH) proposes to preserve the fusion similarity among multimodal instances for explicitly capturing their heterogeneous correlations in training instances (Liu, Ji et al., 2017). By contrast, supervised cross-modal hashing methods, which explore class labels of training instances so as to enhance the semantic information in hash codes, generally can achieve better retrieval performance. Supervised Matrix Factorization Hashing (SMFH)
144
T. Yao, L. Yan, Y. Ma et al. / Neural Networks 125 (2020) 142–152
formulates to preserve both the labels based similarity and the local geometric consistency in each modality for learning better hash codes (Tang et al., 2016). Semantic Preserving Hashing (SePH) transforms semantic affinity matrix into a probability distribution, and then hash codes are learned by approximating it in Hamming space (Lin, Ding et al., 2015). However, due to the large size of pair-wise similarity matrix, the computing complexity of these methods generally is too high, making them unscalable to large-scale datasets. To efficiently solve the optimization problem, Semantic Correlation Maximization (SCM) formulates to learn hash functions sequentially by preserving the pair-wise similarities in training instances (Zhang & Li, 2014). Label Consistent Matrix Factorization Hashing (LCMFH) jointly incorporates class labels and collective matrix factorization technique to learn more discriminate unified hash codes for the pair-wise heterogeneous instances (Wang et al., 2018). However, these methods do not take quantization error into consideration, which may result in less effective hash codes. Asymmetric Discrete Cross-Modal Hashing (ADCMH) incorporates collective matrix factorization and the pair-wise similarity matrix to learn hash codes, and then hash codes are generated bit-by-bit (Luo, Zhang et al., 2018). Dictionary Learning based Supervised Discrete Hashing (DLSDH) learns sparse representations for all instances from different modalities, and then discrete hash codes are generated bit-by-bit (Wu, Luo, Xu, Guo, & Shi, 2018). However, the hash codes are learned bit-by-bit, which is still time-consuming. Compared with previous studies, FDCH preserves both the class label and pair-wise similarity based semantic consistencies in hash codes at the same time, which leads to a higher retrieval performance. Besides, we design an asymmetric hash codes learning model to significantly improve the efficiency of training procedure, and the discrete hash code matrix has a closed-form solution. 3. Fast discrete cross-modal hashing In this section, we present the details of the proposed method, i.e., Fast Discrete Cross-modal Hashing (FDCH). Without loss of generality, we restrict the discussion of FDCH to bimodal case i.e., image and text modality which are the most common in real world. In addition, it can be easily extended to the case with more than two modalities or other modalities. Specifically, the proposed FDCH consists of four components: notations, formulation, optimization algorithm and computing complexity analysis, which are described in Sections 3.1–3.4, respectively. 3.1. Notations Assume that there are n instances O = {o1 , o2 , o3 , . . . , on }, (1) (2) (m) where oi = {xi , xi }, xi denotes the ith instance from the (1) (1) (1) (1) (1) mth modality. X = {x1 , x2 , x3 , . . . , xn } ∈ Rd1 ×n and X (2) = (2) (2) (2) (2) d2 ×n {x1 , x2 , x3 , . . . , xn } ∈ R , d1 and d2 are their corresponding dimensions of the image and text modality (usually d1 ̸ = d2 ), respectively. Y = {y1 , y2 , y3 · · · yn } ∈ {0, 1}c ×n is the class label matrix, where c denotes the total number of categories, and yij = 1 if xj belongs to the ith category and 0 otherwise. The learned hash codes are denoted as B = {b1 , b2 , b3 , . . . , bn } ∈ {−1, 1}k×n , where k denotes the length of hash codes. In order to facilitate cross-modal retrieval task, FDCH formulates to learn hash codes from the pair-wise similarity matrix. In most previous works, the pair-wise similarity matrix is defined as S ∈ {−1, 1}n×n , and Sij = 1 denotes that oi and oj are semantically similar with each other, and -1 otherwise (Lin et al., 2014; Liu et al., 2012; Luo, Zhang et al., 2018). In this paper, to better measure the similarities among class labels, FDCH utilizes cosine
similarity to build the pair-wise similarity matrix. We first define the similarity between oi and oj as follows
∑c S˜ij = √∑
c p=1
p=1
yip
yip yjp
√∑ c
p=1
,
(1)
yjp
where yip denotes the pth element in class label vector yi . For y simplicity, we define y˜ip = √∑c ip , and then ˜ S = ˜ Y T˜ Y. p=1 yip
Therefore, the pair-wise similarity matrix can be defined as S = 2˜ S − 1nn = 2˜ Y T˜ Y − 1nn ,
(2)
n×n
where 1nn ∈ {1} . In Section 3.3, we will use Eq. (2) to reduce the training time and memory cost from O(n2 ) to O(n). Since the semantic relations in heterogeneous data are complex, it is hard to preserve them in hash codes by linear hash functions. Kernel method, which can better capture the underlying nonlinear structure of training instances, has been widely applied in multimedia retrieval domain Liu et al. (2012), Luo, Zhang et al. (2018) and Shen et al. (2015). Accordingly, RBF kernel is adopted to project heterogeneous data to a nonlinear space in this paper. Specifically, we randomly select q instances as anchors which are denoted as a = [a1 , a2 , . . . , aq ]. Then the kernel(m) ized feature of each instance can be represented by φ (xi ) = 2 (m) xi −a1
2 (m) xi −a2
2 (m) xi −aq
[exp(− 2σ 2 ), exp(− 2σ 2 ), . . . , exp(− 2σ 2 )], where m m m σm is a scalable parameter for the mth modality, and ∥ · ∥ denotes the l2 -norm. In this paper, q is set to 500. 3.2. Formulation In multi-modal learning, it is imperative to preserve the semantic consistency in hash codes. It is commonly known that those instances related to similar semantic topics are expected to have similar hash codes. Many supervised hashing methods formulate to preserve the pair-wise similarities in the learned hash codes, and achieve promising performance. A common formulation of these methods is that the inner product of hash codes is utilized to approximate the pair-wise similarities (Lin, Ding et al., 2015; Liu et al., 2012; Yao et al., 2017; Zhang & Li, 2014). Thus, the objective function can be defined as follows
2
arg min kS − BT BF B
s.t .
(3)
B ∈ {−1, 1}k×n , BBT = nIk ,
where ∥ · ∥2F denotes Frobenius norm, and Ik denotes the identity matrix of size k × k. The constraint BBT = nIk enforces the bits to be uncorrelated. However, it is very challenging to solve the symmetric matrix factorization problem let alone solve it fastly and discretely (Shi, Sun, Lu, Hong & Razaviyayn, 2017; Xia, Pan, Lai, Liu, & Yan, 2014). Most existing works relax the discrete constraint on hash codes, and then adopt a sequentially learning model to obtain continuous solutions (Liu et al., 2012; Luo, Zhang et al., 2018; Xia et al., 2014; Yao et al., 2017). Moreover, the computing complexity and memory cost are too high for these methods, which makes them unscalable to large-scale datasets. To address above issues, we propose an asymmetric hash codes learning model by transferring class labels to hash codes. Specifically, we impose label consistent constraint on B by a latent semantic space U ∈ Rk×c . This constraint ensures that those instances who have same class labels, have same hash codes. To fulfill this purpose, we have arg min α ∥B − UY ∥2F B,U
s.t . B ∈ {−1, 1}k×n , BBT = nIk ,
(4)
T. Yao, L. Yan, Y. Ma et al. / Neural Networks 125 (2020) 142–152
where α is a penalty parameter, and Reg(·) = ∥ · ∥2F denotes the regularization term to reduce the risk of overfitting. Thereafter, we substitute one of B with the real-valued labels embedding, and keep the consistency of it in the optimization procedure, i.e.,
2
arg min kS − BT UY F + α ∥B − UY ∥2F
B,U
(5)
s.t . B ∈ {−1, 1}k×n , BBT = nIk . Obviously, the challenging symmetric matrix factorization issue can be avoided. Moreover, for each modality, FDCH formulates to learn a group of hash functions to map heterogeneous data into a sharing Hamming space for addressing the out-of-sample issue. In this paper, the hash functions for the mth modality are denoted as follows (m)
(m)
h(m) (φ (xi )) = sgn(Wm φ (xi )),
(6)
where sgn(·) denotes element-wise sign function, and Wm is a linear projection matrix. Then we can learn Wm by minimizing arg min
B,W1 ,W2
2 ∑
βm ∥B − Wm φ (X (m) )∥2F + µReg(W1 , W2 )
m=1
s.t .
arg min
U , W1 , W2
B ∈ {−1, 1}k×n , BBT = nIk ,
βm ∥UY − Wm φ (X (m) )∥2F
+ µReg(W1 , W2 ).
min
F
2 ∑
βm ∥UY − Wm φ (X (m) )∥2F + µReg(W1 , W2 ) (9)
Due to the hard constraint on B, it is challenging to solve Eq. (9). To address this issue, we soften the hard constraint to get min
kS − BT UY 2 + α ∥B − UY ∥2 F
+
W1 φ (X (1) )φ (X (1) )T − UY φ (X (1) )T +
2 ∑
µ W1 = 0. β1
(12)
µ −1 Iq ) . β1
(13)
Then, we can obtain W1 as W1 = UY φ (X (1) )T (φ (X (1) )φ (X (1) )T +
µ
Note that Y φ (X (1) )T (φ (X (1) )φ (X (1) )T + β Iq )−1 is a constant 1 term, which can be pre-calculated only once before the iterative optimization procedure. W 2 -step: Update W2 by fixing other variables Similar to solve W1 , we can get W2 = UY φ (X (2) )T (φ (X (2) )φ (X (2) )T +
µ −1 Iq ) . β2
(14)
Similarly, Y φ (X (2) )T (φ (X (2) )φ (X (2) )T + β Iq )−1 can also be pre2 calculated before the iterative optimization procedure. B -step: Update B by fixing other variables By discarding the irrelevant terms to B, this problem can be written as argmintr(BT UYY T U T B) − 2tr(kBT UYS T ) B
(15)
+ α tr(BT B) − 2α tr(BT UY ) s.t . B ∈ {−1, 1} T
, UYY U = nIk . T
T
T
B = sgn(kUYS T + α UY )
F
m=1
B,U ,W1 ,W2
Obviously, this problem is a regularized least squares problem which can be solved easily. Typically, it can be solved by letting the derivative of Eq. (11) with respect to W1 equal to zero
Here, tr(B UYY U B) and tr(BT B) are constants, then we get the optimal solution of B
s.t . B ∈ {−1, 1}k×n , BBT = nIk .
arg
(11)
W1
T
kS − BT UY 2 + α ∥B − UY ∥2 +
(8)
m=1
B,U ,W1 ,W2
2
argmin β1 UY − W1 φ (X (1) )F + µ ∥W1 ∥2F .
k×n
By integrating Eqs. (5) and (8), we can obtain arg
By discarding the irrelevant terms to W1 , this problem can be written as
µ
(7)
where βm and µ are penalty parameters, and Reg(W1 , W2 ) = ∥W1 ∥2F + ∥W2 ∥2F . To improve the quality of hash functions, we also substitute the discrete B with the real-valued labels embedding. We can get 2 ∑
145
F
βm ∥UY − Wm φ (X (m) )∥2F + µReg(W1 , W2 )
m=1
s.t . B ∈ {−1, 1}k×n , UYY T U T = nIk . (10) Since both the pair-wise similarity matrix and class labels are incorporated to learn hash codes, the semantic consistency can be better preserved in them. Therefore, the learned hash codes are more precise in retrieval task. 3.3. Optimization
However, the computing complexity and memory cost of YS T are O(n2 ), which cannot be tolerable in large-scale applications. To address this issue, we replace S with Eq. (2), then this term can be transformed to YS T = Y (2˜ Y T˜ Y − 1nn ) = 2Y ˜ Y T˜ Y − Y 1nn .
(17)
For the first term 2Y ˜ Y T˜ Y , we first compute Y ˜ Y T whose size is c × c, then the computing complexity and memory cost of this term can be reduced to O(n). Besides, this term is a constant which can be pre-calculated only once before the iterative optimization procedure. For the second term, since the size of 1nn is n × n, the computing complexity and memory cost of this term are O(n2 ) too. However, this term∑can be replaced by a n constant matrix D ∈ Rc ×n , where Dij = p=1 Yip . Therefore, the computing complexity and memory cost of this term can also be reduced to O(n). Besides, this term is also a constant which can be pre-calculated only once before the iterative optimization procedure. U -step: Update U by fixing other variables By discarding the irrelevant terms to U, this problem can be written as
2
arg min kS − BT UY F + α ∥B − UY ∥2F
U
+ There are four variables in Eq. (10) therefore it is non-convex and hard to solve. Here, we propose a four-step optimization algorithm to handle this problem, and the details are elaborated as follows. W 1 -step: Update W1 by fixing other variables
(16)
2 ∑
βm ∥UY − Wm φ (X (m) )∥2F + µReg(U)
(18)
m=1
s.t . UYY T U T = nIk . It has been proved that it is not necessarily orthogonal to each other for the mapping vectors and the performance without
146
T. Yao, L. Yan, Y. Ma et al. / Neural Networks 125 (2020) 142–152
Algorithm 1: Fast Discrete Cross-modal Hashing
4. Experiments
Input: The hash code length k, the features of multi-modal instances {X (1) , X (2) } and their labels Y . (1) (2) 1: Randomly selecting q anchors to map X , X to nonlinear (1) (2) spaces φ (X ), φ (X ), respectively. 2: Constructing similarity matrix S by class labels, and initializing W1 W2 , U and B randomly. 3: for i = 1 to T do 4: Fix W2 , U and B, update W1 by Eq.(13), 5: Fix W1 , U and B, update W2 by Eq.(14), 6: Fix W1 , W2 and U, update B by Eq.(16), 7: Fix W1 , W2 and B, update U by Eq.(20), 8: end for Output: Projection matrices W1 , W2 and hash codes B.
In this section, we carried out comparison experiments on three real world datasets to evaluate the proposed FDCH against several cross-modal hashing methods in terms of both retrieval accuracy and efficiency. All experiments are carried out on a workstation which has two Intel Xeon CPUs E5-2630 v4(2.2 GHz/ 10-core/25MB) and 128 G memory.
imposing the orthogonal constraint might outperform that with the orthogonal constraint in practice (Wang, Kumar, & Chang, 2012; Zhang & Li, 2014). Motivated by this, we propose to discard the orthogonal constraint in this step. Setting the derivative of Eq. (18) with respect to U equals zero, we get kBSY T + BBT UYY T − α BY T + α UYY T
+
2 ∑
βm (UYY T − Wm φ (X (m) )Y T ) = 0.
(19)
m=1
Then we can obtain U as U = (BBT + α I + β1 I + β2 I)−1 (kBSY T + α BY T + β1 W1 φ (X (1) ) + β2 W2 φ (X (2) ))(YY T )−1 .
(20)
As B-step does, the computing complexity and memory cost of SY T can also be reduced from O(n2 ) to O(n), and it can be precalculated only once before the iterative optimization procedure for efficiency. Moreover, the term of (YY T )(−1) is a constant, and it also can be pre-calculated only once before the iterative optimization procedure for efficiency. Clearly, since all sub-problems are convex to one variable in each step, the objective function value is lower (or equal) than that of the last step. Therefore, repeating the above four steps will lead to the minimum value of objective function. Note that all valuables have closed-form solutions, which is more efficient than most existing discrete hashing methods. We summarize the whole optimization algorithm of our FDCH in Algorithm 1. Moreover, after learning the projection matrices of both modalities, the hash codes of queries can be obtained by their corresponding hash functions. 3.4. Computing complexity analysis
4.1. Datasets In our experiments, three public datasets are utilized to evaluate the effectiveness and efficiency of the proposed FDCH. Wiki dataset: (Rasiwasia et al., 2010): It contains a total of 2866 instances crawled from Wikipedia. Each instance comprises of an image and its corresponding short text, and it is manually annotated by one of 10 class labels. The images are represented by 4096-dimensional CNN feature vectors (Krizhevsky, Sutskever, & Hinton, 2012), and the texts are represented by 10-dimensional topic feature vectors. 75% instances are randomly selected to form the training set, and the rest form the query set. Mirflickr25K: (Huiskes & Lew, 2008): It has 25,000 instances crawled from flickr. Each instance includes one image and their associated tags, and it is annotated by more than one of 24 class labels. In our experiments, we only retain those tags appearing at least 20 times, and discard those instances without textual tags or class labels. As a result, 20,015 instances are kept in our experiments. The images are represented by 4096-dimensional CNN feature vectors (Krizhevsky et al., 2012), and the texts are represented by 500-dimensional feature vectors derived from PCA on their bag of words vectors. Similar to Wiki dataset, 75% instances are randomly selected to form the training set, and the rest of instances form the query set. NUS-WIDE: (Chua et al., 2009): It is a multi-modal dataset with 269,648 instances. Each instance contains an image and its associated tags, and belongs to at least one of 81 semantic categories. Following Ding et al. (2014), Ma, Liang, Kong, and He (2016), Zhang and Li (2014) and Zhou et al. (2014), only those instances who belong to the top 10 most frequent class labels (186,776 instances) are kept. The images are encoded to 1000dimensional bag of visual words feature vectors, and texts are encoded to bag of key words feature vectors. As Ding et al. (2014), Ma et al. (2016), Yao, Wang et al. (2019) and Zhang and Li (2014) do, the entire dataset is divided into two parts: a training set with 99% instances, and a query set with 1% instances. For SMFH, since the training time is too high on all available training instances in NUS-WIDE dataset, we randomly select 10,000 instances to train hash functions to reduce training time, as references (Wang et al., 2018; Zhou et al., 2014) do. For other methods, all instances in training dataset are employed to train hash functions. Moreover, we consider two instances to be semantically similar if they share at least one common class label; otherwise, they are semantically dissimilar. 4.2. Baseline methods and implementation details
In this section, we elaborate the computing complexity of the proposed FDCH and demonstrate that it can be applied to large-scale datasets. The computing complexity of our FDCH is related to the iterative discrete optimization. In each iteration, the computing complexity of updating W1 , W2 , U and B are O(kcn + kqn + q2 n + q3 ), O(kcn + kqn + q2 n + q3 ), O(k2 n + k3 + 2kcn + 2kqn + c 2 n + c 3 ) and O(2kcn), respectively. Moreover, in our experiments, we find that our optimal algorithm converges rapidly, i.e., typically less than 10 times. Since the code length k, the number of classes c and the dimension of kernelized feature q are far smaller than the size of training set n, the computing complexity of FDCH is linear to the training set size n which makes it scalable to large-scale applications.
To validate the effectiveness of our FDCH, we compare it against several state-of-the-art cross-modal hashing works including three unsupervised ones, i.e., CMFH (Ding et al., 2014), LSSH (Zhou et al., 2014) and FSH (Liu, Ji et al., 2017), and five supervised ones, i.e., PDH (Rastegari et al., 2013), SCM-Seq (Zhang & Li, 2014), SMFH (Tang et al., 2016), DASH (Ma et al., 2016) and SRLCH (Liu et al., 2018). Since DASH (Ma et al., 2016) proposes to firstly use the image or text modality to generate hash codes, we denote them as DASHi and DASHt, respectively. Source codes of all baseline methods are kindly provided by authors. For baseline methods, we report the retrieval performances with their default parameter settings. All experiments are run 5 times to reduce the influences of random initialization, and we average the results.
T. Yao, L. Yan, Y. Ma et al. / Neural Networks 125 (2020) 142–152
147
Table 1 MAP@100 scores comparison on Wiki, Mirflickr25K and NUS-WIDE datasets. Task
Methods
Image query Text
Text query Image
Wiki
Mirflickr25K 24
32
64
16
24
32
64
16
24
32
64
PDH (Rastegari et al., 2013) SCM-Seq (Zhang & Li, 2014) CMFH (Ding et al., 2014) LSSH (Zhou et al., 2014) SMFH (Tang et al., 2016) DASHi (Ma et al., 2016) DASHt (Ma et al., 2016) FSH (Liu, Ji et al., 2017) SRLCH (Liu et al., 2018)
0.2057 0.2674 0.2187 0.2211 0.2438 0.2951 0.2869 0.2440 0.4400
0.2065 0.2710 0.2190 0.2269 0.2611 0.3051 0.2821 0.2613 0.4510
0.2170 0.2805 0.2287 0.2303 0.2484 0.3069 0.3099 0.2618 0.4641
0.2246 0.2950 0.2403 0.2581 0.2669 0.3100 0.3175 0.2671 0.4672
0.6247 0.8684 0.6361 0.6319 0.6416 0.8516 0.8283 0.5539 0.6953
0.6346 0.8710 0.6396 0.6432 0.6489 0.8528 0.8378 0.5580 0.7603
0.6421 0.8762 0.6402 0.6519 0.6522 0.8654 0.8367 0.5614 0.8110
0.6570 0.8840 0.6497 0.6690 0.6599 0.8770 0.8681 0.5980 0.8513
0.4373 0.6170 0.5114 0.4966 0.3966 0.6453 0.5653 0.5279 0.7922
0.4276 0.6346 0.5270 0.5046 0.3952 0.6491 0.5544 0.5096 0.8154
0.4318 0.6501 0.5289 0.5255 0.3977 0.6597 0.5572 0.5195 0.8429
0.4570 0.6866 0.5403 0.5284 0.4052 0.7892 0.5795 0.5419 0.8561
FDCH
0.4527
0.4796
0.4849
0.4857
0.9641
0.9740
0.9761
0.9769
0.8124
0.8675
0.8770
0.8834
PDH (Rastegari et al., 2013) SCM-Seq (Zhang & Li, 2014) CMFH (Ding et al., 2014) LSSH (Zhou et al., 2014) SMFH (Tang et al., 2016) DASHi (Ma et al., 2016) DASHt (Ma et al., 2016) FSH (Liu, Ji et al., 2017) SRLCH (Liu et al., 2018)
0.2326 0.6305 0.5257 0.5981 0.6363 0.6218 0.5974 0.5518 0.6392
0.2538 0.6347 0.5181 0.6115 0.6514 0.6502 0.6176 0.5736 0.6630
0.2425 0.6357 0.5352 0.6207 0.6591 0.6613 0.6222 0.5907 0.6727
0.2617 0.6354 0.5440 0.6169 0.6603 0.6623 0.6239 0.5880 0.6706
0.6357 0.8419 0.6309 0.7093 0.6419 0.8281 0.8279 0.5349 0.7931
0.6395 0.8749 0.6381 0.7388 0.6478 0.8636 0.8605 0.5578 0.8381
0.6493 0.8752 0.6474 0.7460 0.6681 0.8669 0.8644 0.5603 0.8419
0.6581 0.8803 0.6400 0.7593 0.6752 0.8703 0.8726 0.5687 0.8782
0.5232 0.5865 0.5257 0.6081 0.4130 0.6514 0.5833 0.5592 0.8498
0.5361 0.6138 0.5370 0.6328 0.3929 0.6615 0.6066 0.5631 0.8544
0.5403 0.6141 0.5455 0.6482 0.4151 0.6630 0.6132 0.5704 0.8659
0.5510 0.6309 0.5518 0.6507 0.4168 0.6634 0.6207 0.5816 0.8715
FDCH
0.6845
0.6908
0.6923
0.6946
0.9315
0.9441
0.9534
0.9581
0.9140
0.9226
0.9251
0.9278
4.3. Evaluation criteria For fairly performing evaluation, we adopt three popular metrics in multi-media retrieval domain to evaluate the effectiveness of our FDCH, i.e., Mean Average Precision (MAP), Precision-Recall (PR) curve and Top-K precision curve. Given a query and retrieved results list, the value of average precision (AP) is defined as AP =
R 1∑
L
NUS-WIDE
16
P(r) ∗ δ (r)
(21)
r =1
where L is the number of semantically related instances to the query in the top R retrieved instances, and P(r) is the precision value in the top r retrieved instances. δ (r) = 1 if the rth retrieved instance is semantically related to the query, otherwise, δ (r) = 0. The MAP can be obtained by averaging the AP values of all queries. In addition, MAP@R denotes the MAP scores calculated by the top R retrieved instances. Besides, Top-K precision reflects the variation of precision values with respect to the number of retrieved instances ranked at top K , and the Precision-Recall (PR) curve reflects the precision values varying with respect to different recall ratios. The three criteria are expressive for multimedia retrieval tasks, and larger values denote better retrieval performance. 4.4. Experimental result analysis The MAP@100 results of all compared methods on the three real world datasets with encoding length setting to 16, 24, 32 and 64 bits are summarized in Table 1. From this table, we have the following observations: (1) For the two retrieval tasks, the proposed FDCH obtains the best performance on the three datasets on all code lengths and performs much better than baseline methods in some cases, which well shows the effectiveness of FDCH. Specifically, compared to the best baseline method, FDCH obtains at most 7%, 12%, 6% performance gains for the image-query-text task on Wiki, Mirflickr25K and NUSWIDE respectively, and at most 4%, 11%, 8% performance gains for the text-query-image task on Wiki, Mirflickr25K and NUSWIDE respectively. The significant improvement verifies that both embedding the pair-wise similarity and class label based semantic consistencies into hash codes learning and the discrete optimization can generate more discriminative hash codes.
(2) The MAP@100 scores of our FDCH go up with the code lengths increasing. This phenomenon is reasonable, longer code length can encode more semantic information in hash codes. (3) The supervised hashing methods generally perform better than that of unsupervised ones, e.g., SCM-Seq, DASHi, DASHt and SRLCH. The reason is that the hash codes learned by supervised methods contain semantic information which can make them more discriminative. The Top-K precision curves in the cases of 16 bits code length on the three real world datasets are plotted in Fig. 1. From this figure, we can draw the following observations. First, FDCH outperforms the baseline methods and performs much better than the baseline methods in some cases, which well demonstrate its superiority. Second, supervised methods, such as SCM-Seq, DASHi, DASHt and SRLCH, generally outperform unsupervised ones, and this phenomenon is consistent with the MAP@100 results shown in Table 1. Fig. 2 plots the Precision-Recall curves on the three real world datasets at 16 bits code length. We have the similar observations with the MAP@100 and Top-K precision curves. FDCH always outperforms the baseline methods and performs much better than the baseline methods in some cases. The significant improvement of FDCH can be principally attributed to its capability of better preserving the semantic consistency in the learned hash codes and the proposed discrete optimization algorithm. To visually show the retrieval results, we provide two crossmodal retrieval cases on Wiki dataset by the proposed FDCH method and the best compared method SRLCH. For space limiting, we only provide some key words or sentences for text instances. For the text query, we show the top 10 nearest images in Fig. 3. We can observe that the returned images of FDCH are more semantically relevant to the query text than that of SRLCH. Similarly, for the image query, the top 10 nearest texts are shown in Fig. 4. It can be seen that all texts retrieved by FDCH are semantically relevant to the query image while the ninth text is semantically irrelevant to the query image for SRLCH. Above results demonstrate that the semantic consistency can be better preserved in the learned hash codes for the proposed FDCH. Specifically, both the pair-wise similarity and class label based semantic consistencies are preserved in hash codes for our FDCH, while only the class label based semantic consistency is preserved in hash codes for SRLCH.
148
T. Yao, L. Yan, Y. Ma et al. / Neural Networks 125 (2020) 142–152
Fig. 1. Precision-Recall curves of FDCH and the baseline methods with 16 bits code length on Wiki, Mirflickr25K and NUS-WIDE datasets.
Fig. 2. Top-K curves of FDCH and the baseline methods with 16 bits code length on Wiki, Mirflickr25K and NUS-WIDE datasets.
Fig. 3. An example of the text-query-image retrieval with 16 bits hash codes on Wiki dataset. Images with red boxes are wrong, while those with green boxes are correct. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 4. An example of the image-query-text retrieval with 16 bits hash codes on Wiki dataset. Texts with red boxes are wrong, while those with green boxes are correct. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
T. Yao, L. Yan, Y. Ma et al. / Neural Networks 125 (2020) 142–152
149
Fig. 5. The experimental comparison with SCSH at 16 bits on the three datasets. The red line denotes the results of our FDCH, and the black one denotes the results of SCSH. The square denotes the results for the Text-query-Image task, and the triangle denotes the results for the Image-query-Text task. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Table 2 MAP@100 scores comparison between FDCH and FDCH-Orth on the three datasets. Datasets Wiki
Mirflickr25K
NUS-WIDE
Tasks
Methods
16 bits
24 bits
32 bits
64 bits
Image-query -Text Text-query -Image
FDCH-Orth FDCH FDCH-Orth FDCH
0.6832 0.6845 0.4466 0.4527
0.6788 0.6908 0.4571 0.4796
0.6863 0.6923 0.4670 0.4849
0.6796 0.6946 0.4641 0.4857
Image-query -Text Text-query -Image
FDCH-Orth FDCH FDCH-Orth FDCH
0.9067 0.9315 0.9491 0.9641
0.9163 0.9441 0.9589 0.9740
0.9134 0.9534 0.9554 0.9761
0.9335 0.9581 0.9709 0.9769
Image-query -Text Text-query -Image
FDCH-Orth FDCH FDCH-Orth FDCH
0.8679 0.9140 0.7807 0.8124
0.8857 0.9226 0.8021 0.8675
0.8936 0.9251 0.7976 0.8770
0.8696 0.9278 0.8017 0.8834
4.5. Effects of discarding the orthogonal constraint To verify the effectiveness of discarding the orthogonal constraint in U-step, we also design an optimization algorithm with the orthogonal constraint retained, and termed as FDCH-Orth. As reference (Shi, Xing et al., 2017) √ does, we can obtain U = CY T (YY T + µIc )−1 , where C = nQAT , A = JM T Q Λ−1 ∈ Rn×k , J = In − 1n 11T is a centering matrix, and Q ∈ Rk×k and Λ−1 can be obtained based on the SVD of MJM T = Q Λ2 Q T . The comparison experimental results of FDCH and FDCH-Orth are shown in Table 2. Although there is an inconsistency in learning procedure for FDCH, FDCH significantly outperforms FDCH-Orth which demonstrates the effectiveness of discarding the orthogonal constraint in U-step.
and significantly outperforms SCSH on NUS-WIDE dataset. The reason is twofold: (1) SCSH relaxes the discrete constraints to generate hash codes, which generally causes large quantization error and consequent less effective hash codes. While FDCH can learn discrete hash codes directly. (2) Due to the high computing complexity and memory cost, only partial training instances can be utilized to train the model on NUS-WIDE dataset for SCSH which results in information loss of training instances. While all available training instances can be utilized to train the model on NUS-WIDE dataset for FDCH. 4.7. The relations between k and c We also conduct experiments with different code lengths on the three datasets to investigate the relations between the code length k and the number of categories c. Specifically, for Mirflickr25K dataset, we choose those instances belonging to the randomly selected 8 and 16 categories respectively to form two subsets of it. Similarly, two subsets of Wiki and NUS-WIDE datasets are generated by those instances belonging to the randomly selected 4 and 7 categories, respectively. The retrieval performances on the generated subsets and the three datasets are shown in Table 3. From this table, we have following observations: (1) With k fixed, the smaller c is, the better performance generally can be obtained. The reason is that the larger c is, the more complex the semantic correlations among instances generally become. (2) With c fixed, the larger k is, the better performance generally can be obtained. The reason is that longer hash codes can incorporate more semantic information among training instances. 4.8. Convergence study
4.6. Effects of the asymmetric strategy To verify the effectiveness of the proposed asymmetric strategy, we conduct comparison experiments with the symmetric strategy. For the symmetric strategy, the symmetric matrix factorization is a very challenging problem (Shi, Sun et al., 2017), let alone solve it discretely. In this paper, scalable coordinate descent (SCD) algorithm is applied to solve the hash codes matrix B (Xia et al., 2014), and other valuables are solved by the same algorithms as the proposed FDCH. For ease of presentation, we name this method as Semantic Consistency Symmetric Hashing, SCSH for short. Due to the high computing complexity and memory cost, we randomly select 10000 instances from NUS-WIDE dataset to train the model for SCSH. The performance comparisons on the three datasets are shown in Fig. 5. Intuitively, higher performance should be obtained by the symmetric strategy. However, from Fig. 5, it can be observed that FDCH outperforms SCSH in all cases
As illustrated in Algorithm 1, each updating step decreases the objective function values and the objective function values will reach a local minima. Fig. 6 shows the convergence curve on Wiki and Mirflickr25K datasets. As can be seen, the scheme will converge less than 5 iterations for Wiki dataset, and will converge after about 6 iterations for the Mirflickr25K dataset. This phenomenon clearly demonstrates the proposed optimal algorithm has very promising convergence efficiency. 4.9. Training time To further verify the efficiency of the proposed FDCH, the training time comparisons on the three datasets are reported in Table 4. We keep two significant figures at least, and we do not report the training time of SMFH on NUS-WIDE dataset in our experiments due to its large training costs. From Table 4, it can
150
T. Yao, L. Yan, Y. Ma et al. / Neural Networks 125 (2020) 142–152
Table 3 The retrieval performance comparisons on the generated subsets and the three datasets. Dataset
Wiki
Categories (c)
4
Code Length (k)
16
24
32
64
16
24
32
64
16
24
32
64
Image-query-Text Text-query-Image
0.5760 0.7713
0.5800 0.7945
0.5862 0.8005
0.6076 0.8028
0.4941 0.7651
0.5267 0.7851
0.5321 0.7898
0.5414 0.7943
0.4527 0.6845
0.4796 0.6908
0.4849 0.6923
0.4857 0.6946
Dataset
Mirflickr25K
Categories (c)
8
Code Length (k)
16
24
32
64
16
24
32
64
16
24
32
64
Image-query-Text Text-query-Image
0.9679 0.9455
0.9764 0.9541
0.9792 0.9625
0.9910 0.9809
0.9662 0.9344
0.9745 0.9491
0.9783 0.9610
0.9798 0.9682
0.9641 0.9315
0.9740 0.9441
0.9761 0.9534
0.9769 0.9581
Dataset
NUS-WIDE
Categories (c)
4
Code Length (k)
16
24
32
64
16
24
32
64
16
24
32
64
Image-query-Text Text-query-Image
0.8736 0.9290
0.9331 0.9683
0.9379 0.9733
0.9546 0.9810
0.8338 0.9173
0.8718 0.9294
0.8826 0.9347
0.8887 0.9389
0.8124 0.9140
0.8675 0.9226
0.8770 0.9251
0.8834 0.9278
7
10
16
24
7
Table 4 Training time (in seconds) comparison with 16 bits code length on the three datasets. Methods
Wiki
Mirflickr25K
NUS-WIDE
PDH (Rastegari et al., 2013) SCM-Seq (Zhang & Li, 2014) CMFH (Ding et al., 2014) LSSH (Zhou et al., 2014) DASHi (Ma et al., 2016) DASHt (Ma et al., 2016) FSH (Liu, Ji et al., 2017) SMFH (Tang et al., 2016) SRLCH (Liu et al., 2018) FDCH
39 18036 16 14 17 10 1164 228 1.2 0.91
335 20374 54 71 25 17 2752 842 6.1 5.5
1044 46 89 594 9.5 9.8 1046 – 34 23
be seen that the training time of FDCH performs best on the three datasets in most cases compared with baseline methods. Hence, the FDCH possesses not only a competitive training time but also better retrieval performance compared with baseline methods. Moreover, the training time of SCM-Seq is sensitive to the dimension of input features, and it consumes much more time on Wiki and Mirflickr-25K datasets than that of NUS-WIDE dataset. 4.10. Parameter sensitivity There are four key parameters of FDCH, i.e., α , β1 , β2 and µ. The parameters of FDCH are chosen by a cross validation procedure. Fig. 7 shows the sensitivity analysis of FDCH with respect to each parameter. From this figure, we can see that FDCH generally obtains good performance when α , β1 , β2 and µ are in the range of [50000,200000], [10, 5000], [10, 1000] and [0.1, 1],
10
respectively. Thus we set α = 100000, β1 = 10, β2 = 10 and µ = 0.5 as default in our experiments. 4.11. Ablation study Our model includes four terms, i.e., pair-wise similarity consistency (PSC), label consistency (LC), hashing functions learning (HFL) and regularization (Reg). Note that hash codes of queries are generated by hash functions, therefore the third term is an indispensable part in our objective function. We perform ablation study on other three terms, and Table 5 shows the retrieval performance on the three datasets. The highest retrieval performance is achieved when all terms are available while the retrieval performance is lower when any term is absent. Moreover, the retrieval performance with label consistency is higher than that without it which supports our claim about the advantage of using the label consistency. 5. Conclusion In this paper, we propose a novel method to perform crossmodal retrieval, which projects heterogeneous data to a sharing Hamming space. The similarities among heterogeneous data are calculated rapidly in this space. For learning more discriminative hash codes, both the pair-wise similarity and class label based semantic consistencies are preserved in hash codes. To avoid the high memory cost and computing complexity in training procedure, we design an asymmetric hash codes learning model. Subsequently, we introduce a fast discrete optimal algorithm to generate discrete hash codes directly, which can avoid the quantization error. Our framework significantly exceeds the compared methods which demonstrate that semantic correlations can be
Fig. 6. Objective function values versus iterations of our FDCH on the (a) Wiki and (b) Mirflickr25K datasets.
T. Yao, L. Yan, Y. Ma et al. / Neural Networks 125 (2020) 142–152
151
Fig. 7. Parameter studies with 16 bits code length on Wiki, Mirflickr25K and NUS-WIDE datasets. Table 5 Ablation study on the three datasets. Wiki PC LC Reg Image-query-Text Text-query-Image
√ √
0.4384 0.6802
Mirflickr25K
√
√ √
√
√ √
√ √ √
0.2172 0.3021
0.1463 0.1879
0.4527 0.6845
0.9549 0.9243
better preserved in the learned hash codes. However, this work only focuses on learning hash codes in a shallow manner. In the future work, we will implement our cross-modal hashing framework in an end-to-end deep mode to boost its performance. Acknowledgment This work is supported by the National Natural Science Foundation of China (Grant Nos. 61872170, 61772253, 61903172, 61771231, 61873117). References Andoni, A., & Indyk, P. (2006). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science (pp. 459–468). Bronstein, M. M., Bronstein, A. M., Michel, F., & Paragios, N. (2010). Data fusion through cross-modality metric learning using similarity-sensitive hashing. In IEEE conference on computer vision and pattern recognition (pp. 3594–3601). Chao, L., Cheng, D., Ning, L., Wei, L., Gao, X., & Tao, D. (2018). Self-supervised adversarial hashing networks for cross-modal retrieval. In IEEE conference on computer vision and pattern recognition (pp. 4242–4251). Cheng, D., Xu, S., Ding, K., Meng, G., Xiang, S., & Pan, C. (2017). AMVH: asymmetric multi-valued hashing. In IEEE conference on computer vision and pattern recognition (pp. 736–744). Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y. (2009). Nus-wide: A real-world web image database from National University of Singapore. In ACM conference on image and video retrieval (pp. 48–56). Deng, C., Chen, Z., Liu, X., Gao, X., & Tao, D. (2018). Triplet-based deep hashing network for cross-modal retrieval. IEEE Transactions on Image Processing, 27(8), 3893–3903. Deng, C., Yang, E., Liu, T., Li, J., Liu, W., & Tao, D. (2019). Unsupervised semanticpreserving adversarial hashing for image search. IEEE Transactions on Image Processing, 28(8), 4032–4044. Deng, C., Yang, E., Liu, T., & Tao, D. (2019). Two-stream deep hashing with class-specific centers for supervised image search. IEEE Transactions on Neural Networks and Learning Systems.
NUS-WIDE
√
√ √
√
√ √
√ √ √
0.9431 0.9126
0.6442 0.5930
0.9641 0.9315
0.8035 0.9108
√ √
√ √
√ √ √
0.8047 0.8997
0.3282 0.3615
0.8124 0.9140
Ding, G., Guo, Y., & Zhou, J. (2014). Collective matrix factorization hashing for multimodal data. In IEEE conference on computer vision and pattern recognition (pp. 2083–2090). Gui, J., Liu, T., Sun, Z., Tao, D., & Tan, T. (2018). Fast supervised discrete hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2), 490–496. Hu, M., Yang, Y., Shen, F., Xie, N., Hong, R., & Shen, H. T. (2019). Collective reconstructive embeddings for cross-modal hashing. IEEE Transactions on Image Processing, http://dx.doi.org/10.1109/TCYB.2018.2831447. Hu, M., Yang, Y., Shen, F., Xie, N., & Shen, H. T. (2018). Hashing with angular reconstructive embeddings. IEEE Transactions on Image Processing, 27(2), 545–555. Huiskes, M. J., & Lew, M. S. (2008). The MIR flickr retrieval evaluation. In ACM sigmm international conference on multimedia information retrieval (pp. 39–43). Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In The annual conference on neural information processing systems (pp. 1097–1105). Li, B., Fan, Z., Zhang, X., & Huang, D. (2019). Robust dimensionality reduction via feature space to feature space distance metric learning. Neural Networks, 112(10), 1–14. Lin, Z., Ding, G., Hu, M., & Wang, J. (2015). Semantics-preserving hashing for cross-view retrieval. In IEEE conference on computer vision and pattern recognition (pp. 3864–3872). Lin, G., Shen, C., & Av, H. (2015). Supervised hashing using graph cuts and boosted decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11), 2317–2331. Lin, G., Shen, C., Shi, Q., Hengel, A. V. D., & Suter, D. (2014). Fast supervised hashing with decision trees for high-dimensional data. In IEEE conference on computer vision and pattern recognition (pp. 1963–1970). Liu, H., Ji, R., Wu, Y., Huang, F., & Zhang, B. (2017). Cross-modality binary code learning via fusion similarity hashing. In IEEE conference on computer vision and pattern recognition (pp. 6345–6353). Liu, L., Lin, Z., Shao, L., Shen, F., Ding, G., & Han, J. (2017). Sequential discrete hashing for scalable cross-modality similarity retrieval. IEEE Transactions on Image Processing, 26(1), 107–118. Liu, X., Nie, X., Zhou, Q., Xi, X., Zhu, L., & Yin, Y. (2019). Supervised shortlength hashing. In International joint conferences on artificial intelligence (pp. 3031–3037). Liu, W., Wang, J., Ji, R., Jiang, Y. G., & Chang, S. F. (2012). Supervised hashing with kernels. In Computer vision and pattern recognition (pp. 2074–2081).
152
T. Yao, L. Yan, Y. Ma et al. / Neural Networks 125 (2020) 142–152
Liu, L., Yang, Y., Hu, M., Xing, X., Shen, F., Ning, X., & Zi, H. (2018). Index and retrieve multimedia data: cross-modal hashing by learning subspace relation. In International conference on database systems for advanced applications (pp. 606–621). Long, M., Cao, Y., Wang, J., & Yu, P. S. (2016). Composite correlation quantization for efficient multimodal retrieval. In ACM SIGIR conference on research and development in information retrieval (pp. 579–588). Lu, X., Zhu, L., Cheng, Z., Li, J., Nie, X., & Zhang, H. (2019). Flexible online multimodal hashing for large-scale multimedia retrieval. In ACM international conference on multimedia (pp. 1129–1137). Lu, X., Zhu, L., Cheng, Z., Nie, L., & Zhang, H. (2019). Online multi-modal hashing with dynamic query-adaption. In ACM SIGIR conference on research and development in information retrieval (pp. 715–724). Luo, Y., Yang, Y., Shen, F., Huang, Z., Zhou, P., & Shen, H. T. (2018). Robust discrete code modeling for supervised hashing. Pattern Recognition, 75, 128–135. Luo, X., Zhang, P., Wu, Y., Zhenduo, C., Huang, H., & Xu, X. (2018). Asymmetric discrete cross-modal hashing. In ACM international conference on multimedia retrieval (pp. 204–212). Ma, D., Liang, J., Kong, X., & He, R. (2016). Frustratingly easy cross-modal hashing. In ACM international conference on multimedia (pp. 237–241). Pan, Y., Yao, T., Li, H., Ngo, C. W., & Mei, T. (2015). Semi-supervised hashing with semantic confidence for large scale visual search. In ACM special interest group on information retrieval (pp. 53–62). Peng, Y., Huang, X., & Qi, J. (2016). Cross-media shared representation by hierarchical learning with multiple deep networks. In International joint conference on artificial intelligence (pp. 3846–3853). Peng, Y., Qi, J., & Huang, X. (2018). CCL: cross-modal correlation learning with multi-grained fusion by hierarchical network. IEEE Transactions on Multimedia, 20(2), 405–420. Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G. R., Levy, R., & Vasconcelos, N. (2010). A new approach to cross-modal multimedia retrieval. In ACM multimedia (pp. 251–260). Rastegari, M., Choi, J., Fakhraei, S., Daumé Iii, H., & Davis, L. S. (2013). Predictable dual-view hashing. In International conference on international conference on machine learning (pp. 1328–1336). Shen, F., Shen, C., Liu, W., & Shen, H. T. (2015). Supervised discrete hashing. In IEEE conference on computer vision and pattern recognition (pp. 37–45). Shen, X., Shen, F., Sun, Q. S., Yang, Y., Yuan, Y. H., & Shen, H. T. (2017). Semi-paired discrete hashing: Learning latent hash codes for semi-paired cross-view retrieval. IEEE Transactions on Cybernetics, 47(12), 4275–4288. Shi, Q., Sun, H., Lu, S., Hong, M., & Razaviyayn, M. (2017). Inexact block coordinate descent methods for symmetric nonnegative matrix factorization. IEEE Transactions on Signal Processing, 65(22), 5995–6008. Shi, X., Xing, F., Xu, K., Sapkota, M., & Yang, L. (2017). Asymmetric discrete graph hashing. In AAAI conference on artificial intelligence (pp. 2541–2547). Song, J., Yang, Y., Yang, Y., Huang, Z., & Shen, H. T. (2013). Inter-media hashing for large-scale retrieval from heterogenous data sources. In ACM international conference on management of data (pp. 785–796). Tang, J., Wang, K., & Shao, L. (2016). Supervised matrix factorization hashing for cross-modal retrieval. IEEE Transactions on Image Processing, 25(7), 3157–3166. Wang, D., Gao, X. B., Wang, X., & He, L. (2018). Label consistent matrix factorization hashing for large-scale cross-modal similarity search. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99), 1.
Wang, D., Gao, X., Wang, X., He, L., & Yuan, B. (2016). Multimodal discriminative binary embedding for large-scale cross-modal retrieval. IEEE Transactions on Image Processing, 25(10), 4540–4554. Wang, J., Kumar, S., & Chang, S. F. (2012). Semi-supervised hashing for large-scale search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12), 2393–2406. Wang, B., Yang, Y., Xu, X., Hanjalic, A., & Shen, H. T. (2017). Adversarial cross-modal retrieval. In ACM on multimedia conference (pp. 154–162). Weiss, Y., Torralba, A., & Fergus, R. (2009). Spectral hashing. In The annual conference on neural information processing systems (pp. 1753–1760). Wu, Y., Luo, X., Xu, X., Guo, S., & Shi, Y. (2018). Dictionary learning based supervised discrete hashing for cross-media retrieval. In ACM international conference on multimedia retrieval (pp. 222–230). Wu, B., Yang, Q., Zheng, W., Wang, Y., & Wang, J. (2015). Quantized correlation hashing for fast cross-modal search. In International joint conference on artificial intelligence (pp. 25–31). Wu, Z., & Zou, M. (2014). An incremental community detection method for social tagging systems using locality-sensitive hashing. Neural Networks, 58(10), 14–28. Xia, R., Pan, Y., Lai, H., Liu, C., & Yan, S. (2014). Supervised hashing for image retrieval via image representation learning. In AAAI conference on artificial intelligence (pp. 2156–2162). Yao, T., Kong, X., Fu, H., & Qi, T. (2017). Supervised coarse-to-fine semantic hashing for cross-media retrieval. Digital Signal Processing, 63, 135–144. Yao, T., Kong, X., Fu, H., & Tian, Q. (2019). Discrete semantic alignment hashing for cross-media retrieval. IEEE Transactions on Cybernetics, http://dx.doi.org/ 10.1109/TCYB.2019.2912644. Yao, T., Wang, G., Yan, L., Kong, X., Su, Q., Zhang, C., & Tian, Q. (2019). Online latent semantic hashing for cross-media retrieval. Pattern Recognition, 89, 1–11. Yao, T., Zhang, Z., Yan, L., Yue, J., & Tian, Q. (2019). Discrete robust supervised hashing for cross-modal retrieval. IEEE Access, 7, 39806–39814. Yunchao, G., Svetlana, L., Albert, G., & Florent, P. (2013). Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2916–2929. Zhang, D., & Li, W.-J. (2014). Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. In AAAI conference on artificial intelligence (pp. 2177–2183). Zhen, Y., & Yeung, D.-Y. (2012). Co-regularized hashing for multimodal data. In The annual conference on neural information processing systems (pp. 1376–1384). Zheng, C., Zhu, L., Lu, X., Li, J., Cheng, Z., & Zhang, H. (2019). Fast discrete collaborative multi-modal hashing for large-scale multimedia retrieval. IEEE Transactions on Knowledge and Data Engineering. Zhou, J., Ding, G., & Guo, Y. (2014). Latent semantic sparse hashing for crossmodal similarity search. In ACM special interest group on information retrieval (pp. 415–424). Zhu, L., Huang, Z., Li, Z., Xie, L., & Shen, H. T. (2018). Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval. IEEE Transactions on Neural Networks and Learning Systems, 29(11), 5264–5276.