Information Processing and Management xxx (xxxx) xxxx
Contents lists available at ScienceDirect
Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman
Label consistent locally linear embedding based cross-modal hashing☆ Hui Zenga, Huaxiang Zhang a b
⁎,a,b
, Lei Zhua,b
School of Information Science and Engineering, Shandong Normal University, Jinan, Shandong Province 250014, China Institute of Data Science and Technology, Shandong Normal University, Jinan, Shandong Province 250014, China
ARTICLE INFO
ABSTRACT
Keywords: Cross-modal retrieval Discrete optimization Hashing
Hashing methods have gained widespread attention in cross-modal retrieval applications due to their efficiency and effectiveness. Many works have been done but they fail to capture the feature based similarity consistency or the discriminative semantics of label consistency. In addition, most of them suffer from large quantization loss, resulting in low retrieval performance. To address these issues, we propose a novel cross-modal hashing method named Label Consistent Locally Linear Embedding based Cross-modal Hashing (LCLCH). LCLCH preserves the non-linear manifold structure of different modality data by Locally Linear Embedding, and transforms heterogeneous data into a latent common semantic space to reduce the semantic gap and support cross-modal retrieval tasks. Therefore, it not only discovers the potential correlation of heterogeneous cross-modal data but also maintains label consistency. To further ensure the effectiveness of hash code learning, we utilize an iterative quantization method to handle the discrete optimization task and obtain the hash codes directly. We compare LCLCH with some advanced supervised and unsupervised methods on three datasets to evaluate its effectiveness.
1. Introduction Large-scale and heterogeneous multi-modal data are proliferated in various fields during the last decade. This trend increases the demand for technologies of different modal data searching. Although the relevant data from different modalities are similar in semantics, they are heterogeneous in representation. Many methods (Dourado, Pedronette, & da Silva Torres, 2019; Fang, Tuan, Hui, & Wu, 2018; Lashkari, Bagheri, & Ghorbani, 2019; Yang, He, & Xu, 2017) for information retrieval in special conditions have been proposed, but it is difficult to use them to meet the aforementioned demand. Recently, cross-modal retrieval can retrieve heterogeneous cross-modal data, which has been brought into public focus. In general, a user is able to utilize one modality data to search for semantically relevant terms of another modality through cross-modal retrieval. Data from different modalities are usually in different feature space and the heterogeneity of data becomes the biggest challenge in cross-modal retrieval. To facilitate cross-modal retrieval, many promising subspace based methods (Liang, Ma, Li, Huang, & Qi, 2017; Wang, He, Wang, Wang, & Tan, 2016; Xu et al., 2018) have been proposed. However, large-scale cross-modal retrieval is still a challenging task due to the rapid growth of multimodal data. With the increase of high-dimensional data, hash based cross-modal methods have gradually attracted people’s attention for fast query speed and low storage cost. Cross-modal hashing methods can compress high-dimensional data into length-fixed hash codes, Fully documented templates are available in the elsarticle package on http://www.ctan.org/tex-archive/macros/latex/contrib/elsarticle CTAN. Corresponding author. E-mail addresses:
[email protected],
[email protected] (H. Zhang).
☆ ⁎
https://doi.org/10.1016/j.ipm.2019.102136 Received 31 May 2019; Received in revised form 30 September 2019; Accepted 30 September 2019 0306-4573/ © 2019 Elsevier Ltd. All rights reserved.
Please cite this article as: Hui Zeng, Huaxiang Zhang and Lei Zhu, Information Processing and Management, https://doi.org/10.1016/j.ipm.2019.102136
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
whereas semantic relevant samples share similar binary codes. Thereby, heterogeneous data can be projected into a common Hamming space and they can be directly compared with each other. Recently, many hashing methods for cross-modal retrieval have been proposed. According to whether the supervised information (usually in the form of semantic label) is used or not, hashing methods for cross-modal retrieval can be roughly fall into two categories: unsupervised ones and supervised ones. Unsupervised methods mainly mine the inter-modal and intra-modal similarities for binary codes (or hash functions) without exploiting the semantic information. The representative unsupervised hashing methods include (Ding, Guo, & Zhou, 2014; Liang, Lei, & Chen, 2016; Long, Cao, Wang, & Yu, 2016; Zhou, Ding, & Guo, 2014) and etc. Unlike unsupervised ones, supervised methods incorporate semantic labels to train the model, while preserving semantic information in learning hash codes. (Lin, Ding, Hu, & Wang, 2015; Liu, Hu, Ling, & Cheung, 2018; Lu, Zhu, Cheng, Song, & Zhang, 2019; Tang, Wang, & Shao, 2016; Xu, Shen, Yang, Shen, & Li, 2017; Yao, Kong, Yan, Tang, & Tian, 2019) are the representative supervised methods. Some deep hashing methods including (Chen, Srivastava, Duan, & Xu, 2017; He, Xiang, Kang, Wang, & Pan, 2016; Jiang & Li, 2017; Shang, Zhang, Zhu, & Sun, 2019) have been proposed recently. They employ deep networks to retain global relationships between sample pairs and generate unified hash codes. However, it requires great time complexity to obtain suitable training parameters in deep learning models. Although great progress has been made in supervised cross-modal hashing, there are still many issues to be solved urgently. (1) Most methods convert semantic label into pairwise similarity matrices. For one thing, constructing similarity matrices is time consuming. For another, instances which belong to one category share the same semantic discriminant attributes to distinguish them from instances in other categories. Using pairwise similarity will make label category information meaningless. Therefore, these methods neglect the discriminative capability reflected by the label categories and reduce the retrieval accuracy. (2) Few methods explore direct proximity relationship between neighbors, which is important for the feature based similarity consistency maintaining. (3) Most methods obtain hash codes by directly discarding discrete constraints and simple thresholding, which result in large quantization loss and inaccurate hash codes. Moreover, some methods take advantage of a discrete Cyclic Coordinate Descent method for bit-wise optimization. Although the quantization loss is reduced to some extent, they increase the time cost. To address the aforementioned challenges, we present a novel cross-modal hashing method named Label Consistent Locally Linear Embedding based Cross-modal Hashing (LCLCH). It exploits the Linear Locally Embedding to maintain the manifold structure of the original data and transforms the cross-modal features into a latent common space (the semantic space). Thus, LCLCH makes crossmodal samples with the same semantic information share the same representations in the latent semantic space. In the end, the unified binary representations can be obtained directly by an iterative quantization method. Compared to the other works, the contributions of this work are as follows:
• A novel discrete supervised hashing method is proposed for cross-modal retrieval. We utilize semantic label to guide the hashing • •
learning process and make full use of the supervised information. In this way, it maintains label consistency to ensure the effectiveness of hash code learning. We maintain non-linear manifold structure of data when learning hash codes to effectively construct the approximate relationship between neighbors. It captures the feature based similarity consistency of heterogeneous cross-modal data and enhances the discriminative capability of hash codes. Instead of relaxing discrete constraints or learning the hash codes bit by bit, we generate the discrete binary codes directly by an iterative quantization method. Therefore, LCLCH can avoid large quantization error and make the optimization process more efficient.
The rest of the paper is organized as follows: Related works in the field of manifold learning, quantization in hashing and crossmodal hashing are briefly described in Section 2. We expound the proposed method in detail in Section 3, and Section 4 reports the settings and the experimental results. In the end, Section 5 summarizes the work. 2. Related work 2.1. Manifold learning in hashing Locally Linear Embedding as a typical manifold learning method performs well in dimensionality reduction. Recently, some studies have shown that maintaining the original manifold structure in learning hash code is effective for improving the retrieval performance. Some hash based methods using manifold learning are proposed. Spectral Hashing (Weiss, Torralba, & Fergus, 2008) consists of eigenvectors corresponding to k-minimum eigenvalues of the Laplacian matrix, and binary codes can be obtained by simple thresholding. Inspired by Spectral Hashing, Anchor Graph Hashing (Liu, Wang, Kumar, & Chang, 2011) replaces adjacency matrices with anchor graph matrices, it reduces the time complexity and has better general applicability. Some other hash based methods also exploit manifold learning such as Inductive Manifold Hashing (Shen, Shen, Shi, Van Den Hengel, & Tang, 2013), and Discrete Multi-View Hashing (Yang, Shi, & Xu, 2017). More recently, there are some manifold learning methods that have been used in cross-modal field. For example, Inter-Media Hashing (Song, Yang, Yang, Huang, & Shen, 2013) explores the correlation between samples from different media and addresses the scalability issues. Supervised Matrix Factorization Hashing (Tang et al., 2016) utilizes the graph Laplacian terms to maintain the local geometric manifold structure and semantic relations based on collective matrix factorization. Hetero-Manifold Regularization (Zheng, Tang, & Shao, 2018) and other graph based cross-modal hashing methods have been proposed. Our method is different from the cross-modal hashing methods mentioned above. On the one hand, we fully utilize the label information to adapt to multi-label 2
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
data and cross-modal retrieval task. On the other hand, instead of using Laplacian Eigenmap to solve the problem of maintaining manifold structure, we directly utilize Locally Linear Embedding to obtain better results in local neighbor information preserving. 2.2. Quantization in hashing In hashing methods, obtaining binary codes requires a quantization process. In the generation of hash codes, Hypercube Quantization (Gong, Lazebnik, Gordo, & Perronnin, 2013) is commonly used, which quantizes data points into a set of vertices of a hypercube. When quantization center points are fixed at 1 or −1, the quantization is the Hypercube Quantization, and the problem is written as a mathematical expression.
min xi
yi
2 2
(1)
where y { 1, 1} . The typical methods are Iterative Quantization (Gong et al., 2013). Isotropic Hashing (Kong & Li, 2012), Harmonious Hashing (Xu et al., 2013), Angular Quantization (Gong, Kumar, Verma, & Lazebnik, 2012). Taking Iterative Quantization as an example, it reduces the dimensionality of the original data by principal component analysis, and then it maps the original data to the vertices of a hypercube to solve the projection matrix with the smallest quantization error. Recently, some clustering methods and classification methods (Nie, Tian, & Li, 2018) have introduced such quantization method, and hashing methods for single modal retrieval utilize it to achieve better performance. Currently, some quantization based methods have been proposed for cross-modal retrieval. Shared Predictive Cross-Modal Deep Quantization (Yang et al., 2018) learns the quantizer in a common subspace by semantic label alignment. Differently from it, we adopt the method of orthogonal rotation quantization on the common semantic space to reduce time consumption. In addition, cross-modal or multi-modal retrieval methods rarely take into account quantization algorithms, therefore, quantization is a worthwhile consideration in the field of cross-modal hashing. 2.3. Cross-modal hashing Cross-modal hashing methods have the advantages of satisfied retrieval efficiency and low storage cost in dealing with large-scale data. They can also be divided into supervised ones and unsupervised ones based on whether label information is used during training process. Unsupervised cross-modal hashing methods explore the intra- and inter-modal similarity to learn the hash codes. For instance, Inter-Media Hashing (IMH) (Song et al., 2013) explores a linear regression model to learn hashing functions for each media type and introduces inter-media consistency and intra-media consistency to find a common Hamming space. Both Collective Matrix Factorization Hashing (CMFH) (Ding et al., 2014) and Cluster-based Joint Matrix Factorization Hashing (J-CMFH) (Rafailidis & Crestani, 2016) utilize matrix factorization model to capture the latent structure between data and learn unified hash codes in a common latent space. In the unified latent space, J-CMFH learns cluster representations for cross-modal instances and captures the inter-modality and intra-modality similarities. Unsupervised Semantic-Preserving Adversarial Hashing (USePAH) (Deng et al., 2019) designs a generative adversarial framework and constructs feature similarity and neighbor similarity to guide the learning process, and learns the hash code in an unsupervised manner. Latent Semantic Sparse Hashing (LSSH) (Zhou et al., 2014) learns the latent factors by sparse coding of image modality and matrix factorization of text modality respectively. Although these methods can explore correlation of heterogeneous data, the learned hash codes are not discriminative enough and semantic similarity is not well preserved in Hamming space. Thus they cannot obtain further performance improvement to adapt real-world applications. For supervised ones, they consider label information to train the model. General speaking, since they utilize semantic labels to mitigate the semantic gap and make the learned hash codes more discriminative in semantics, supervised methods can achieve better retrieval performance than unsupervised ones. A series of meaningful and representative supervised cross-modal hashing methods have been proposed. For instance, Semantic Correlation Maximization (SCM) (Zhang & Li, 2014) maximizes the semantic correlation to learn the hash codes by utilizing the semantic information. Based on CMFH (Ding et al., 2014), Supervised Matrix Factorization hashing (SMFH) (Tang et al., 2016) maintains semantic similarity in Hamming space by constraining the hash function with semantic label information. Semantic Preserving Hashing (SePH) (Lin et al., 2015) constructs the affinity matrix by supervised information and learns hash codes by utilizing K-L divergence to approximate the affinity matrix. Pairwise Relationship Guided Deep Hashing (PRGDH) (Yang, Deng et al., 2017) uses different pairwise constraints for inter-modal and intra-modal data and generates discriminative hash codes by end-to-end deep network. It also uses the semantic similarity matrix in the hash code learning process. Generalized Semantic Preserving Hashing (GSePH) (Mandal, Chaudhury, & Biswas, 2017) constructs similarity matrix in single-label paired, single-label unpaired, multi-label paired and multi-label unpaired scenarios and learns hash codes by minimizing the similarity matrix and hamming distance. Discrete Cross-modal Hashing (DCH) (Xu et al., 2017) takes semantic label as a classification and learns the hash codes by a discrete bit-wise optimization manner. These supervised methods are very representative work and have achieved good results. But it is worth noting that most methods consider semantic label as pairwise similarities and neglect the category information. 3. Our method In this section, we describe the proposed method in detail, and show the whole framewok in Fig. 1. 3
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
Fig. 1. The whole framework of our proposed method.
3.1. Notation Suppose the modality number is M and sample number is n. The mth modality training samples are denoted by dm × n, where m X m = [x1m , x 2m , x 3m, …, xnm] {1, 2, …, M } is used to represent mth modality and dm is the feature dimension of the mth modality data. The label matrix is represented as Y ∈ {0, 1}c × n, where c is the number of categories. In addition, B { 1, 1}l × n denotes the hash code matrix, where l is the length of hash codes. sign( · ) is a hash function which atomatically equals 1 or -1 based on the positive and negative of the true-value. 3.2. Proposed method 3.2.1. Locally linear embedding We follow the approach proposed in Roweis and Saul (2000) for linearly constructing a sample with its k nearest neighbors. Linear Locally Embedding(LLE) supposes that, each data point can be represented by a linear combination of its k nearest neighbors. This can be expressed as follows: n
min m w
x im
i=1
k
2
wijm xijm
j=1
k
, 2
wjim = 1
(2)
j
We set the number of neighbors to k and is weighting factor of the linear relationship. If sample xj is not in the neighborhood of sample xi, we set its corresponding wij = 0 . The Eq. (2) can be solved by Lagrange multiplier method, and LLE could be illustrated by the following formula:
wijm
min F
2 m=1
m tr (FAm F
Am = (I Wm )T (I s. t . FF T = nI
T)
Wm)
(3)
Here, we use F to represent a common subspace after LLE, and Wm is a matrix with elements of m represents the mth modal, and here we use m = 1 to represent the image modality and m = 2 to represent the text modality. And αm is a parameter for controlling manifold structure preserving item of the mth modality.
wijm .
3.2.2. Label consistency preserving embedding In order to maintain semantic consistency in the common truth-value space, we utilize the semantic space as the common space. l × c , where l is the length of hash codes and c is the number of label categories. The common realWe set projection matrix as Z value space V is expressed by the following formula: (4)
V = ZY 4
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
Besides, in order to strengthen the correlation between semantic space and locally linear embedding space, F ZY 2F is added into objective function. We want to learn hash codes and hash functions simultaneously, so we project the kernel features into the semantic space. 2
( Pm (X m )
2 F)
ZY
m=1
(5) x
ai 22
). where P is the projection matrix and ϕ( · ) is RBF kernel function illustrated as i (x ) = exp( 2 2 Here, {aj}qj = 1 are anchor points which are randomly selected from training points and q is number of anchor point. 1
n
q
= nq i = 1 j = 1 x i aj 22 is the width parameter. In this work, we consider image modal as the first modal and text modal as second modal. Thus P1 and P2 are projections of image and text modal. The overall label consistency embedding is illtrustrated as: min F
ZY
F , Z , Pm
2 F
2 m=1
+
( Pm (X m )
ZY
2 F)
(6)
s. t . F T F = nI where λ is the balance parameter of the hash function learning item.
3.2.3. Quantization loss We get an inspiration from Iterative Quantization (Gong et al., 2013) that performs orthogonal rotation quantization on truevalues. It could make binary codes have better discriminative capability.
min
B
B, Z , R
RZY
2 F,
RRT = I , B
{ 1, 1}l × n
(7)
where R is the rotation matrix and I is an identity matrix. 3.2.4. Overall objective function Combining the above analysis, we get the final target function: m tr (FAm F
L= s. t . where
min
F , Z , B, R, Pm
FF T
= nI ,
2 m=1
+ F
Pm (X m)
+ +
RRT
T)
B
ZY 2 F
RZY
= I, B
{ 1,
2 F
ZY 2 F
+
Pm
2 F
(8)
1} l × n
is the regularization term, and α, λ, θ and γ are the tradeoff parameters.
Pm 2F
3.3. Optimization In this subsection, we introduce the optimization strategy of LCLCH. The optimization with all variables F, Z, B, R, and Pm is nonconvex. Fortunately, it is convex when one of the five variables to be solved and the others are fixed. Hence, an optimization with iterative solution is given in detail. Solving Pm by fixing F, Z, B, R.Pm-related loss term is expressed as:
Pm (X m )
min Pm
2 F
ZY
+
Pm
2 F
(9)
Let the derivative of Eq. (9) with respect to Pm equal to zero, we obtain:
2 Pm (X m)
T (X m )
2 ZY
T (X m )
(10)
+ 2 Pm I = 0
Then, the closed form solution of Pm can be obtained. T (X m )(
Pm = ZY
(X m )
T (X m )
+ I)
(11)
1
Here, I is an identity matrix. Solving F by fixing Z, B, R and Pm. The problem is then simplified into the following one,
min F
2 m=1
m tr (FAm F
T)
+ F
ZY
2 F
(12)
s. t . FF T = nI Expansion of the above formula is: 2
min F
m tr (FAm F
T)
+ FF T
2ZYF T
(13)
m=1
and then we set the derivative of Eq. (13) with respect to F equal to zero, 5
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al. 2
2
m (FAm )
+ 2F
2ZY = 0
(14)
m=1
We obtain the closed-form solution of F: 1
2
F = ZY
m Am
+I
(15)
m=1
Solving Z by fixing B, F, R, Pm. The above Eq. (8) can be written as: 2
min F
ZY
Z
2 F
( Pm (X m )
+
ZY
m =1
2 F)
+
B
RZY
2 F
(16)
Letting the derivative of Eq. (16) with respect to Z equal zero, we can obtain
2ZYY T 2
2FY T + 2
2 m =1
Pm
2 m =1
(X m ) Y T
ZYY T + 2 RT RZYY T 2 RT BY T = 0
(17)
Then, the closed form solution of Z is
Z = (I + (FY T +
1 2 I + RRT ) m=1 2 P (X m ) Y T + m=1 m
RT BY T )(YY T )
1
(18)
Solving B by fixing Z, F, R, Pm, and we can rewrite Eq. (8) as.
min B
RZY
B
2 F,
B
{ 1, 1} l × n
(19)
And then we can get the closed form solution of B, (20)
B = sign (RZY ) Solving R by fixing Z, F, B, Pm. The above problem by fixing other variables except R is reduced to:
min B
RZY
R
2 F,
RRT = I
(21)
As known to all, the above formula is a classical Orthogonal Procrustes problem (Schönemann, 1966) and can be solved by Singular T Value Decomposition(SVD). Finally, we can obtain B (ZY )T = S S , and then derive the solution of R as
R = SS
(22)
T
Therefore, we can update the above steps iteratively until convergence of the objective function or getting a preset maximum number of iterations. In the first round, B and R are represented by Eqs. (20) and (22), and the rest of the variable matrices are represented by randomly generated distributions. 3.4. Out of sample extension During the training process, we learn the common hash codes to represent different modality data. For an out-of-sample xm, we generate its hash codes by using the following formula (23)
b = sign(Pm (x m ))
This formula represents that the smaple from the mth modality uses the learned hash function projection to generate hash codes. 4. Experimental results The experimental settings and results are reported in this section. All experiments are conducted on a computer with an Intel Core i7-6700 CPU and 16G RAM. The specific details of the experiments are shown as follows. 4.1. Dataset: Wiki (Rasiwasia et al., 2010) consists of 2866 image-text pairs crawled from Wikipedia website. Each image is presented as a 128dimension feature and each text is 10-dimension topic feature generated by latent Dirichlet allocation (LDA) model. Specifically, each image-text pair shares one of 10 category labels. We divide this dataset into the training set with 2173 samples and the test set with 693 samples. MIRFlickr (Huiskes & Lew, 2008) includes 25,000 instances collected from Flickr websites and each image is linked with several tags and at least one of 24 provided labels. We remove image-text pairs that without textual tags or labels and 16,738 image-text pairs are remained. A 150-dimension edge histogram feature and a 500-dimension vector generated from the PCA on its index vector 6
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
represent each image and each text respectively. We select 5% instances as the query set and the rest as the training set. NUS-WIDE (Chua et al., 2009) is a real-world dataset which contains 269,648 instances selected from Flickr. Each instance has an image with its associated textual tags and such the image-text pair has at least one of 81 provided labels. We select the 10 most commonly used concepts which correspond to 186,577 instances as the new dataset. On the basis of this settings, 1% of the instances are randomly chosen to serve as the query set and the rest of instances as the training set. We randomly select 5000 training instances on MIRFlickr and 10,000 training instances on NUS-WIDE to train all models to reduce computational cost. 4.2. Evaluation metrics The most commonly used evaluation metrics in the search field is mean average precision (mAP). Given a query file and list of R retrieved files, the average precision (AP) of a query is defined as
AP =
1 N
R
P (r ) (r )
(24)
r=1
where N presents number of relevant instances in the retrieved results, P(r) is precision of top r related retrieved instances, and δ(r) denotes indication function. δ(r) equals 1 if the rth retrieved instance is real relevant to the query file, otherwise it equals 0. And mAP can be obtained by average of AP of all retrieved samples. In addition, we also plot the precision-recall curve by changing Hamming Radius and the top-N precision curve to measure the experiments. 4.3. Baselines and implementation details We compare some state-of-art unsupervised and supervised cross-modal hashing methods. Unsupervised methods include LSSH (Zhou et al., 2014), CMFH (Ding et al., 2014), CCQ (Long et al., 2016) and FSH (Liu, Ji, Wu, Huang, & Zhang, 2017), and supervised methods include LCMF (Mandai & Biswas, 2017), SCM-orth (Zhang & Li, 2014), SCM-seq (Zhang & Li, 2014), SePH-km (Lin et al., 2015) and DCH (Xu et al., 2017). LSSH (Zhou et al., 2014) performs sparse coding on image samples and performs matrix decomposition on text samples. Then, it projects the representations of one modality to another modality. CMFH (Ding et al., 2014) uses collective matrix factorization to learn the latent representations and then generates the unified hash codes. CCQ (Long et al., 2016) converts the features of different modalities into isomorphic latent space through correlation-maximal mappings, and it then uses the composite quantization to convert the features into hash codes. FSH (Liu et al., 2017) uses an undirected asymmetry graph to simulate the fusion similarities between different modalities and introduces alternating optimization to learn more effective binary codes. LCMF (Mandai & Biswas, 2017) introduces label factorization on the basis of matrix factorization. It retains feature based similarity and label consistency. SCM (Zhang & Li, 2014) generates hash codes by associating similarity matrices with hash codes. It includes SCM-orth and SCMseq. SePH-km (Lin et al., 2015) generates a probability distribution from the radiation matrix and approximates the distribution in Hamming space by using K-L divergence. DCH (Xu et al., 2017) uses linear classification model to effectively maintain semantic information, and uses DCC optimization to reduce quantization loss. In the experiments, we perform two cross-modal retrieval tasks. One task is to use an image as a query to retrieve texts (ItoT). Another is to use a text as a query to retrieve images (TtoI). In addition, we set m = 1 to represent the image modal and m = 2 to represent the text modal. The settings of parameters are 1 = 2 = 0.1, = 0.1, = 100 and = 0.01. 4.3.1. Performance comparisons and analysis In this section, we do experiments to verify the effectiveness of our method. First, we compare the mAP of our method with the state-of-the-art cross-modal retrieval hashing methods. Next, the precision-recall curves and top-N precision curves on 32-bit and 64bit code length are plotted to improve the realiability fo the expriments. In addition, the robustness of the proposed method are demonstrated by parameter sensitivity experiments. The results on the three datasets are clearly reported in Tables 1–3. We can find that the mAP scores of our method for two retrieval tasks outperform other hashing methods. The precision-recall curves and the top-N precision curves from Figs. 2–7 also indicate that, our approach achieves best performance on different bits. On Wiki, we find that the mAP scores of our method on ItoT tasks from 32 bits to 128 bits are 0.3673, 0.3812, 0.3871 and 0.3927, and scores on TtoI tasks are 0.7533, 0.7601, 0.7624 and 0.7620 respectively. Apparently, the performance of LCLCH are superior to the best performance of the all other comparative methods. With the increase of the number of hash bit, the mAP scores of LCLCH are improved. The main reason may be that, longer hash codes can maintain more effective information. We can observe that the improvement of our method on the ItoT task is preferable to that on the TtoI task. It may be that the LLE can find more useful contents in text modality. Compared with DCH, we get better performance because we take local neighbor information into account, and 7
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
Table 1 The mAP scores on Wiki. Methods
LSSH SCM-Orth SCM-Seq SePH-km LCMF FSH CCQ CMFH DCH LCLCH
ItoT task
TtoI task
32bits
64bits
96bits
128bits
32bits
64bits
96bits
128bits
0.1967 0.1371 0.2410 0.2986 0.3550 0.2587 0.1899 0.2186 0.3365 0.3673
0.1968 0.1261 0.2443 0.3055 0.3682 0.2601 0.1902 0.2253 0.3552 0.3812
0.1905 0.1220 0.2488 0.3158 0.3698 0.2671 0.1903 0.2267 0.3570 0.3871
0.1881 0.1235 0.2488 0.3172 0.3724 0.2727 0.1910 0.2348 0.3631 0.3927
0.5342 0.1434 0.2459 0.6501 0.7466 0.4343 0.2223 0.5486 0.7322 0.7533
0.5390 0.1376 0.2482 0.6640 0.7481 0.4561 0.2229 0.5543 0.7348 0.7601
0.5468 0.1287 0.2512 0.6693 0.7524 0.4778 0.2230 0.5653 0.7350 0.7624
0.5471 0.1358 0.2518 0.6723 0.7533 0.4829 0.2237 0.5680 0.7257 0.7620
Table 2 The mAP scores on MIRFlickr. Method
LSSH SCM-Orth SCM-Seq SePH-km LCMF FSH CCQ CMFH DCH LCLCH
ItoT task
TtoI task
32bits
64bits
96bits
128bits
32bits
64bits
96bits
128bits
0.5746 0.5817 0.6345 0.6769 0.6941 0.5879 0.5534 0.5842 0.6790 0.7362
0.5774 0.5738 0.6354 0.6846 0.7121 0.5903 0.5589 0.5751 0.6861 0.7416
0.5797 0.5633 0.6334 0.6855 0.7350 0.5955 0.5608 0.5737 0.6934 0.7460
0.5805 0.5672 0.6421 0.6861 0.7439 0.5997 0.5609 0.5743 0.7034 0.7553
0.5901 0.5901 0.5813 0.7333 0.7680 0.5930 0.5600 0.5903 0.7587 0.8005
0.5917 0.5917 0.5736 0.7394 0.8011 0.5989 0.5602 0.5690 0.7887 0.8108
0.5911 0.5911 0.5632 0.7421 0.8167 0.6029 0.5609 0.5679 0.7914 0.8170
0.5930 0.5930 0.5654 0.7477 0.8193 0.6052 0.5611 0.5693 0.8024 0.8236
Table 3 The mAP scores on NUS-WIDE. Method
LSSH SCM-Orth SCM-Seq SePH-km LCMF FSH CCQ CMFH DCH LCLCH
ItoT task
TtoI task
32bits
64bits
96bits
128bits
32bits
64bits
96bits
128bits
0.3892 0.3773 0.5425 0.5317 0.6250 0.4280 0.3189 0.3520 0.5985 0.6539
0.3956 0.3676 0.5513 0.5418 0.6320 0.4523 0.3190 0.3543 0.5886 0.6589
0.3954 0.3531 0.5426 0.5426 0.6427 0.4571 0.3200 0.3461 0.5759 0.6602
0.3963 0.3540 0.5115 0.5435 0.6493 0.4629 0.3204 0.3432 0.5775 0.6689
0.4146 0.3697 0.5077 0.6219 0.7498 0.4356 0.3175 0.3528 0.7283 0.7667
0.4210 0.3591 0.5138 0.6304 0.7589 0.4607 0.3196 0.3560 0.7274 0.7860
0.4196 0.3501 0.5181 0.6345 0.7792 0.4691 0.3203 0.3464 0.7065 0.7871
0.4203 0.3543 0.5116 0.6377 0.7850 0.4744 0.3214 0.3440 0.6984 0.7907
Fig. 2. Precision-recall curves on Wiki dataset.
8
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
Fig. 3. top-N Precision curves on Wiki dataset.
Fig. 4. Precision-recall curves on MIRFlickr dataset.
Fig. 5. top-N Precision curves on MIRFlickr dataset.
Fig. 6. Precision-recall curves on NUS-WIDE dataset.
quantization of orthogonal rotation can improve the performance of classification. The mAP scores of our method are close to those of LCMF, because both two methods consider label consistency, which makes the generated hash codes more discriminative. On MIRflickr, the mAP scores of LCLCH are 0.7553 on ItoT task and 0.8236 on TtoI task with 128 bits. It can be seen that DCH, LCMF and our method achieve better performance, and our method significantly exceeds LCMF which has the second best performance. Nevertheless, CCQ, LSSH, CMFH and FSH obtain relatively worse performance in all cases because they train the model with unsupervised samples and thus there is no explicit semantic information available. The scores of the methods of converting semantic labels into pairwise similarity matrices on MIRFlickr are lower than those of DCH, LCMF and LCLCH. When pairwise semantic similarities are introduced, category information of given labels is lost and the samples sharing multiple labels and the samples sharing a single label are actuallly identical in the similarity matrix. Thus the performance of the methods using similarity matrices on multi-label dataset is relatively bad. We utilize semantic information to construct a common latent subspace, so the model is more 9
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
Fig. 7. top-N Precision curves on NUS-WIDE dataset.
suitable for datasets with multi-category labels. The experimental results validate the importance of maintaining the consistency of semantic information. On NUS-WIDE, we can observe from the experimental results that our method can achieve superior performance than all baselines under all conditions. As the bit number increases, the performance of our method improves. It verifies that longer hash codes can contain more useful information. The results on NUS-WIDE show that our method is also suitable for large-scale datasets. In addition, it is worth noting that the results of LCLCH are higher than the results of LCMF, especially on MIRFlickr and NUSWIDE dataset. The main reason may be that using orthogonal rotation can reduce noise and make the hash codes more discriminative. MIRFlickr and NUS-WIDE are multi-label datasets that provide more detailed semantic information. LCLCH works better than LCMF indicating that it can reduce the semantic gap better. As the bit number increases, our approach significantly outperforms over DCH. The possible reason is that LCLCH uses the method of directly solving the discrete hash codes while DCH adopts the bitwise solution, so that the influence of DCH by bit number is relatively large. This also proves that the discrete optimization framework is effective. In summary, our method achieves good performance on three benchmark datasets, and the experimental results verify that embedding non-linear manifold structure and treating semantic space as a latent common space to keep label consistency can engender more discriminative binary codes. 4.3.2. Ablation experiments As stated above, LCLCH is mainly to preserve the low-dimensional manifold structure with LLE and keep label consistency in hamming space. Although the previous experimental results demonstrate its effectiveness, we further verify the effectiveness of Locally Linear Embedding item. We name the method that does not use LLE as LCH-1. The objective function of LCH-1 is defined as follows:
L=
min
F , Z , B, R, Pm
2 m=1
s. t . RRT = I , B
Pm (X m )
ZY
2 F
+
B
RZY
2 F
+
Pm
2 F
(25)
{ 1, 1} l × n
We conduct experiments on the three datasets with 64 bits and the mAP results are summarized in Table 4. From the Table 4, we can observe that LCLCH significantly outperforms LCH-1. Particularly, the mAP scores of LCLCH are obviously better than the scores of LCH-1 on TtoI task. These results illustrate the effectiveness of preserving the approximate relationship between neighboring samples by LLE, as well as LCLCH can keep consistency of neighbor similarity better. In our approach, an iterative optimization framework with orthogonal rotation is adopted to directly solve the discrete hash codes. We further validate the importance of the discrete optimization framework in LCLCH. LCH-2 represents the method of removing the item of orthogonal rotation. m tr (FAm F
L= s. t .
min
F , Z , B, R, Pm
FF T
2 m=1
+ +
= nI , B
{ 1,
T)
+ F
Pm (X m) B
ZY
2 F
ZY +
ZY
2 F
2 F
Pm
2 F
(26)
1}l × n
The results are listed in Table 4 and the mAP scores of LCLCH is about 1% higher than LCH-2 on ItoT task and TtoI task on the three Table 4 The mAP scores of LCLCH, LCH-1 and LCH-2. Dataset
LCH-1 LCH-2 LCLCH
Wiki
MIFlickr
NUS-WIDE
ItoT
TtoI
ItoT
TtoI
ItoT
TtoI
0.3525 0.3726 0.3812
0.7505 0.7518 0.7601
0.7342 0.7301 0.7416
0.7969 0.7940 0.8108
0.6422 0.6503 0.6589
0.7618 0.7769 0.7860
10
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
Fig. 8. convergence analysis on two datasets.
datasets. This orthogonal rotation matrix can help maintain discrete constraints during training process and make the learned hash codes more discriminative in classification. It is notable that the influence on iterative quantization item is greater than Locally Linear Embedding item. As a results, it can be demonstrated that the overall discrete optimization framework is very important for improving the effectiveness of the algorithm. 4.3.3. Convergence study Due to the usage of iterative optimization, the speed of convergence is important to judge the performance of an algorithm. Therefore, we plot the convergence speed curves on MIRFlickr and NUS-WIDE datasets with 32 bits. From Fig. 8, we draw a conclusion that our method can converge quickly, and it converges within 15 iterations on the two datasets. 4.3.4. Time cost analysis Here we analyze the time complexity of LCLCH. The time complexity of solving Eqs. (11), (18), (20) and (22) are O (q3 + lcn + lnq + nq2), O (l3 + c 3 + l 2n + lcn + lqn + c 2n), O (l2n + lcn) and O (l2n + lcn + l3) respectively. q represents the number of feature dimension after RBF-kernel. The time complexity is linear to the number of training instance n. Since A is a sparse matrix, the calculation of Eq. (15) is also efficient. The time complexity analysis shows that LCLCH is efficient. In order to intuitively analyze the time consumption of LCLCH, we conduct a comparative experiment of training time on MIRFlickr dataset. The experimental results are shown in Table 5. Although some basic methods like CCQ, CMFH, and SCM take less training time, the retrieval results of them are relatively unsatisfied because the models are too simple. SePH and LCMF use a pairwise solution to solve the hash codes, which is time consuming. Differently from DCH, LSSH, SePH and LCMF, it can be seen that the training time of LCLCH does not change much with the increase of bit number. This is because LCLCH utilizes a discrete optimization framework that directly solves the hash codes. It not only maintains the discrete constraints of the hash codes but also has great advantages in reducing training time. 4.3.5. Parameter analysis In the above subsection, we set fixed parameters to perform the experiments. We conduct some parameter sensitivity experiments to analyze the effect of parameters on the hash code learning. In the experiments, we study the influence of one of the parameters by fixing the others. We set the length of hash codes to 64 bits and plot the mAP-parameter curves in Fig. 9. θ controls the impact of quantization item in the function. Fig. 9(a) shows that, when θ is relatively large or relatively small, it weakens the role of the orthogonal rotation quantization term in the entire objective function. We find that the fluctuation produced Table 5 Training time comparison on MIRFlickr (in seconds). Methods/bits
32bits
64bits
96bits
128bits
LSSH SCM-orth SCM-seq SePH-km LCMF FSH CCQ CMFH DCH LCLCH
9.96 0.05 0.38 408.29 258.09 9.59 3.94 0.13 10.21 8.75
10.87 0.05 0.78 483.32 334.48 10.94 10.97 0.15 32.37 8.86
11.98 0.05 1.25 598.04 414.51 12.61 15.76 0.18 69.81 8.92
12.97 0.05 1.64 720.73 513.43 12.43 45.44 0.19 115.16 9.03
11
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
Fig. 9. parameters analysis.
by θ of on NUS-WIDE and MIRFlickr is weaker than on Wiki. And when θ changes in interval [1,100], our method is under a relatively smooth float. Thus, θ is usually chosen in [1, 100]. λ controls the effect of hash function terms on overall loss. Fig. 9(b) shows that, when λ is large, the experimental results drops significant on the three datasets and the fluctuation on ItoT task is greater than on TtoI task. It shows that, λ has a greater impact on the hash function learning of text modal than image modal. The parameter α is used to balance the item of maintaining manifold structure of different modalities. Therefore, we plot the three-dimensional curves to represent the impact of α1, α2 on MIRFlickr and NUS-WIDE in Fig. 9(c–f). The possible reason why the impact of α1 is less than the impact of α2 is that, Locally Linear Embedding can handle data from text modal better than image modal. Usually, we select the parameter α from the range [0.01, 1].
12
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
Input: feature matrices X m (m=1, 2), label matrix Y, the length of hash codes l, parameters λ, θ, β, γ 1: Randomly initialize Z, Pm , F 2: Initialize R using Eq. (22) 3: Initialize B using Eq. (20) 5:Repeat 6: Update Pm by using Eq. (11). 7: Update F by using Eq. (15). 8: Update Z by using Eq. (18). 9: Update R by using Eq. (22). 10: Update B by using Eq. (20). until: objective function of Eq. (8) converges. Output: unified hash codes B, projection matrix Pm Algorithm 1. Description of algorithm LCLCH.
5. Conclusion In this paper, we propose a Label Consistent Locally Linear Embedding based Cross-modal Hashing method. It uses the Locally Linear Embedding to maintain manifold structure during the process of transforming heterogeneous data into the common space. In order to better take advantage of the semantic information, we treat semantic space as a common subspace to preserve the label consistency. In addition, we directly quantize the semantic real-values into discrete binary codes to reduce the quantization error. The experimental results on three standard datasets validate the effectiveness of our method. Our future work will focus on manifold structure maintaining and discrete optimization to execute cross-modal retrieval. Acknowledgments The work is partially supported by the National Natural Science Foundation of China (Nos. 61772322, 61572298, U1836216) and the Key Research and Development Foundation of Shandong Province (Nos. 2017GGX10117, 2017CXGC0703). References Chen, L., Srivastava, S., Duan, Z., & Xu, C. (2017). Deep cross-modal audio-visual generation. Proceedings of the on thematic workshops of ACM multimedia 2017. ACM349–357. Chua, T. S., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y. (2009). Nus-wide: A real-world web image database from national university of singapore. ACM international conference on image & video retrieval. Deng, C., Yang, E., Liu, T., Li, J., Liu, W., & Tao, D. (2019). Unsupervised semantic-preserving adversarial hashing for image search. IEEE Transactions on Image Processing, 28(8), 4032–4044. Ding, G., Guo, Y., & Zhou, J. (2014). Collective matrix factorization hashing for multimodal data. Computer vision & pattern recognition. Dourado, I. C., Pedronette, D. C. G., & da Silva Torres, R. (2019). Unsupervised graph-based rank aggregation for improved retrieval. Information Processing & Management, 56(4), 1260–1279. Fang, L., Tuan, L. A., Hui, S. C., & Wu, L. (2018). Syntactic based approach for grammar question retrieval. Information Processing & Management, 54(2), 184–202. Gong, Y., Kumar, S., Verma, V., & Lazebnik, S. (2012). Angular quantization-based binary codes for fast similarity search. Advances in neural information processing systems1196–1204. Gong, Y., Lazebnik, S., Gordo, A., & Perronnin, F. (2013). Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2916–2929. He, Y., Xiang, S., Kang, C., Wang, J., & Pan, C. (2016). Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia, 18(7), 1363–1377. Huiskes, M. J., & Lew, M. S. (2008). The mir flickr retrieval evaluation. Proceedings of the 1st ACM international conference on multimedia information retrieval. ACM39–43. Jiang, Q.-Y., & Li, W.-J. (2017). Deep cross-modal hashing. Proceedings of the IEEE conference on computer vision and pattern recognition3232–3240. Kong, W., & Li, W.-J. (2012). Isotropic hashing. Advances in neural information processing systems1646–1654. Lashkari, F., Bagheri, E., & Ghorbani, A. A. (2019). Neural embedding-based indices for semantic search. Information Processing & Management, 56(3), 733–755. Liang, X., Lei, Z., & Chen, G. (2016). Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimedia Tools & Applications, 75(15), 9185–9204. Liang, Z., Ma, B., Li, G., Huang, Q., & Qi, T. (2017). Cross-modal retrieval using multiordered discriminative structured subspace learning. IEEE Transactions on Multimedia, 19(6), 1220–1233. Lin, Z., Ding, G., Hu, M., & Wang, J. (2015). Semantics-preserving hashing for cross-view retrieval. Proceedings of the IEEE conference on computer vision and pattern recognition3864–3872. Liu, H., Ji, R., Wu, Y., Huang, F., & Zhang, B. (2017). Cross-modality binary code learning via fusion similarity hashing. Proceedings of the IEEE conference on computer vision and pattern recognition7380–7388. Liu, W., Wang, J., Kumar, S., & Chang, S. F. (2011). Hashing with graphs. Proc international conference on. Liu, X., Hu, Z., Ling, H., & Cheung, Y. (2018). MTFH: A matrix tri-factorization hashing framework for efficient cross-modal retrieval. CoRR, abs/1805.01963. Long, M., Cao, Y., Wang, J., & Yu, P. S. (2016). Composite correlation quantization for efficient multimodal retrieval. Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM579–588. Lu, X., Zhu, L., Cheng, Z., Song, X., & Zhang, H. (2019). Efficient discrete latent semantic hashing for scalable cross-modal retrieval. Signal Processing, 154, 217–231. Mandai, D., & Biswas, S. (2017). Label consistent matrix factorization based hashing for cross-modal retrieval. 2017 IEEE international conference on image processing (ICIP). IEEE2901–2905. Mandal, D., Chaudhury, K. N., & Biswas, S. (2017). Generalized semantic preserving hashing for n-label cross-modal retrieval. The IEEE conference on computer vision and pattern recognition (CVPR).
13
Information Processing and Management xxx (xxxx) xxxx
H. Zeng, et al.
Nie, F., Tian, L., & Li, X. (2018). Multiview clustering via adaptively weighted procrustes. Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM2022–2030. Rafailidis, D., & Crestani, F. (2016). Cluster-based joint matrix factorization hashing for cross-modal retrieval. Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM781–784. Rasiwasia, N., Pereira, J. C., Coviello, E., Doyle, G., Lanckriet, G. R. G., Levy, R., et al. (2010). A new approach to cross-modal multimedia retrieval. International conference on multimedia. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. Schönemann, P. H. (1966). A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1), 1–10. Shang, F., Zhang, H., Zhu, L., & Sun, J. (2019). Adversarial cross-modal retrieval based on dictionary learning. Neurocomputing, 355, 93–104. Shen, F., Shen, C., Shi, Q., Van Den Hengel, A., & Tang, Z. (2013). Inductive hashing on manifolds. Proceedings of the IEEE conference on computer vision and pattern recognition1562–1569. Song, J., Yang, Y., Yang, Y., Huang, Z., & Shen, H. T. (2013). Inter-media hashing for large-scale retrieval from heterogeneous data sources. Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM785–796. Tang, J., Wang, K., & Shao, L. (2016). Supervised matrix factorization hashing for cross-modal retrieval. IEEE Transactions on Image Processing, 25(7), 3157–3166. Wang, K., He, R., Wang, L., Wang, W., & Tan, T. (2016). Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis & Machine Intelligence, 38(10), 2010–2023. Weiss, Y., Torralba, A., & Fergus, R. (2008). Spectral hashing. International conference on neural information processing systems. Xu, B., Bu, J., Lin, Y., Chen, C., He, X., & Cai, D. (2013). Harmonious hashing. Twenty-third international joint conference on artificial intelligence. Xu, P., Yin, Q., Huang, Y., Song, Y.-Z., Ma, Z., Wang, L., et al. (2018). Cross-modal subspace learning for fine-grained sketch-based image retrieval. Neurocomputing, 278, 75–86. Xu, X., Shen, F., Yang, Y., Shen, H. T., & Li, X. (2017). Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Transactions on Image Processing, 26(5), 2494–2507. Yang, C., He, B., & Xu, J. (2017). Integrating feedback-based semantic evidence to enhance retrieval effectiveness for clinical decision support. Asia-pacific web (APWEB) and web-age information management (WAIM) joint conference on web and big data. Springer153–168. Yang, E., Cheng, D., Chao, L., Wei, L., Jie, L., & Dacheng, T. (2018). Shared predictive cross-modal deep quantization. IEEE Transactions on Neural Networks & Learning Systems, PP(99), 1–12. Yang, E., Deng, C., Liu, W., Liu, X., Tao, D., & Gao, X. (2017). Pairwise relationship guided deep hashing for cross-modal retrieval. Thirty-first AAAI conference on artificial intelligence. Yang, R., Shi, Y., & Xu, X.-S. (2017). Discrete multi-view hashing for effective image retrieval. Proceedings of the 2017 ACM on international conference on multimedia retrieval. ACM175–183. Yao, T., Kong, X., Yan, L., Tang, W., & Tian, Q. (2019). Efficient discrete supervised hashing for large-scale cross-modal retrieval. arXiv preprint arXiv:1905.01304,. Zhang, D., & Li, W. J. (2014). Large-scale supervised multimodal hashing with semantic correlation maximization. Twenty-eighth AAAI conference on artificial intelligence. Zheng, F., Tang, Y., & Shao, L. (2018). Hetero-manifold regularisation for cross-modal hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5), 1059–1071. Zhou, J., Ding, G., & Guo, Y. (2014). Latent semantic sparse hashing for cross-modal similarity search. Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval. ACM415–424.
14